Guaranteeing availability of target data to remote initiators via a hybrid source/target credit scheme

ABSTRACT

A device includes a converged input/output controller that includes a physical target storage media controller, a physical network interface controller and a gateway between the storage media controller and the network interface controller, wherein gateway provides a direct connection for storage traffic and network traffic between the storage media controller and the network interface controller.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. Ser. No. 14/640,717, filedMar. 6, 2015 (DWIS-0004-U01) and entitled “METHODS AND SYSTEMS FORCONVERGED NETWORKING AND STORAGE”, which is hereby incorporated byreference in its entirety.

U.S. Ser. No. 14/640,717 claims the benefit of the following provisionalapplications, each of which is hereby incorporated by reference in itsentirety: U.S. patent application 61/950,036, filed Mar. 8, 2014(DWIS-0002-P01) and entitled “METHOD AND APPARATUS FOR APPLICATIONDRIVEN STORAGE ACCESS”; and U.S. patent application 62/017,257, filedJun. 26, 2014 (DWIS-0003-P01) and entitled “APPARATUS FOR VIRTUALIZEDCLUSTER IO”.

FIELD OF THE INVENTION

This application relates to the fields of networking and data storage,and more particularly to the field of converged networking and datastorage devices.

BACKGROUND OF THE INVENTION

The proliferation of scale-out applications has led to very significantchallenges for enterprises that use such applications. Enterprisestypically choose between solutions like virtual machines (involvingsoftware components like hypervisors and premium hardware components)and so-called “bare metal” solutions (typically involving use of anoperating system like Linux™ and commodity hardware. At large scale,virtual machine solutions typically have poor input-output (IO)performance, inadequate memory, inconsistent performance, and highinfrastructure cost. Bare metal solutions typically have static resourceallocation (making changes in resources difficult and resulting ininefficient use of the hardware), challenges in planning capacity,inconsistent performance, and operational complexity. In both cases,inconsistent performance characterizes the existing solutions. A needexists for solutions that provide high performance in multi-tenantdeployments, that can handle dynamic resource allocation, and that canuse commodity hardware with a high degree of utilization.

FIG. 1 depicts the general architecture of a computing system 102, suchas a server, functions and modules of which may be involved in certainembodiments disclosed herein. Storage functions (such as access to localstorage devices on the server 102, such as media 104 (e.g., rotatingmedia or flash) and network functions such as forwarding havetraditionally been performed separately in either software stacks orhardware devices (e.g., involving a network interface controller 118 ora storage controller 112, for network functions or storage functions,respectively). Within an operating system stack 108 (which may includean operating system and a hypervisor in some embodiments including allthe software stacks associated with storage and networking functions forthe computing system), the software storage stack typically includesmodules enabling use of various protocols that can be used in storage,such as the small computer system interface (SCSI) protocol, the serialATA (SATA) protocol, the non-volatile memory express (NVMe) protocol (aprotocol for accessing disk-attached storage (DAS), like solid-statedrives (SSDs), through the PCI Express (PCIe) bus 110 of a typicalcomputing system 102) or the like. The PCIe bus 110 may provide aninterconnection between a CPU 106 (with processor(s) and memory) andvarious IO cards. The storage stack also may include volume managers,etc. Operations within the storage software stack may also include dataprotection, such as mirroring or RAID, backup, snapshots, deduplication,compression and encryption. Some of the storage functions may beoffloaded into a storage controller 112. The software network stackincludes modules, functions and the like for enabling use of variousnetworking protocols, such as Transmission Control Protocol/InternetProtocol (TCP/IP), the domain name system protocol (DNS), the addressresolution protocol (ARP), forwarding protocols, and the like. Some ofthe network functions may be offloaded into a network interfacecontroller 118 (or NIC) or the network fabric switch, such as via anethernet connection 120, in turn leading to a network (with variousswitches, routers and the like). In virtualized environments, a NIC 118may be virtualized into several virtual NICs as specified by SR-IOVunder the PCI Express standard. Although not specified by the PCIExpress standard and not as common, storage controllers can also bevirtualized in a similar manner. This approach allows virtual entities,such as virtual machines, access to their own private resource.

Referring to FIG. 2, one major problem with hypervisors is with thecomplexity of IO operations. For example, in order to deal with anoperation involving data across two different computers (computer system1 and computer system 2 in FIG. 2), data must be copied repeatedly, overand over, as it moves among the different software stacks involved inlocal storage devices 104, storage controllers 112, the CPUs 106,network interface controller 118 and the hypervisor/operating systems108 of the computers, resulting in large numbers of inefficient datacopies for each IO operation whenever an activity is undertaken thatinvolves moving data from one computer to another, changing theconfiguration of storage, or the like. The route 124 is one of manyexamples of the complex routes that data may take from one computer toanother, moving up and down the software stacks of the two computers.Data that is sought by computing system 2 may be initially located in alocal storage device 104, such as a disk, of computing system 1, thenpulled by a storage controller card 112 (involving an IO operation andcopying), send over the PCIe bus 110 (another IO operation) to the CPU108 where it is handled by a hypervisor or other software component ofthe OS stack 108 of computing system 1. Next, the data may be delivered(another IO operation) through the network controller 118 and over thenetwork 122 (another set of IO operations) to computing system 2. Theroute continues on computing system 2, where data may travel through thenetwork controller 118 and to the CPU 106 of computing system 2 (involveadditional IO operations), then sent over the PCIe bus 110 to the localstorage controller 112 for storage, then back to the hypervisor/OS stack108 for actual use. These operations may occur across a multiplicity ofpairs of computing systems, with each exchange involving this kind ofproliferation of IO operations (and many other routes are possible, eachinvolving significant numbers of operations). Many such complex datareplication and transport activities among computing systems arerequired in scaleout situations, which are increasingly adopted byenterprises. For example, when implementing a scaleout application likeMongoDB™, customers must repeatedly run real time queries duringrebalancing operations, and perform large scale data loading. Suchactivities involve very large numbers of IO operations, which result inpoor performance in hypervisor solutions. Users of those applicationsalso frequently re-shard (change the shards on which data is deployed),resulting in big problems for bare metal solutions that have staticstorage resource allocations, as migration of data from one location toanother also involves many copying and transport operations, with largenumbers of IO operations. As the amount of data used in scaleoutapplications grows rapidly, and the connectedness among disparatesystems increases (such as in cloud deployments involving manymachines), these problems grow exponentially. A need exists for storageand networking solutions that reduce the number and complexity of IOoperations and otherwise improve the performance and scaleability ofscaleout applications without requiring expensive, premium hardware.

Referring still to FIG. 2, for many applications and use cases, data(and in turn, storage) needs to be accessed across the network betweencomputing systems 102. Three high-level steps of this operation includethe transfer of data from the storage media of one computing system outof a box, movement across the network 122, and the transfer of data intoa second box (second computing system 102) to the storage media 104 ofthat second computing system 102. First, out of the box transfer, mayinvolve intervention from the storage controller 112, the storage stackin the OS 108, the network stack in the OS 108, and the networkinterface controller 118. Many traversals and copying across internalbusses (PCIe 110 and memory) as well as CPU 106 processing cycles arespent. This not only degrades performance (creating latency andthroughput issues) of the operation, but also adversely affects otherapplications that run on the CPU. Second, once the data leaves the box,102 and moves onto the network 122, it is treated like any other networktraffic and needs to be forwarded/routed to its destination. Policiesare executed and decisions are made. In environments where a largeamount of traffic is moving, congestion can occur in the network 122,causing degradation in performance as well as problems with availability(e.g., dropped packets, lost connections, and unpredictable latencies).Networks have mechanisms and algorithms to avoid spreading ofcongestion, such as pause functions, backward congestion notification(BCN), explicit congestion notification (ECN), etc. However, these arereactive methods; that is, they detect formation of congestion pointsand push back on the source to reduce congestion, potentially resultingin delays and performance impacts. Third, once the data arrives at its“destination” computing system 102, it needs to be processed, whichinvolves intervention from the network interface controller 118, thenetwork stack in the OS 108, the storage stack in the OS 108, and thestorage controller 112. As with out of the box operations noted above,many traversals and copying across internal busses as well as CPU 106processing cycles are spent. Further, the final destination of the datamay well reside in still a different box. This can be the result of aneed for more data protection (e.g., mirroring or across-box RAID) orthe need for de-duplication. If so, then the entire sequence ofout-of-the box, across the network, and into the box data transfer needsto be repeated again. As described, limitations of this approach includedegradation in raw performance, unpredictable performance, impact onother tenants or operations, availability and reliability, andinefficient use of resources. A need exists for data transfer systemsthat avoid the complexity and performance impacts of the currentapproaches.

As an alternative to hypervisors (which provide a separate operatingsystem for each virtual machine that they manage), technologies such asLinux™ containers have been developed (which enable a single operatingsystem to manage multiple application containers). Also, tools such asDockers have been developed, which provide provisioning for packagingapplications with libraries. Among many other innovations describedthroughout this disclosure, an opportunity exists for leveraging thecapabilities of these emerging technologies to provide improved methodsand systems for scaleout applications.

SUMMARY

Provided herein are methods and systems that include a converged storageand network controller in hardware that combines initiator, targetstorage functions and network functions into a single data and controlpath, which allows a “cut-through” path between the network and storage,without requiring intervention by a host CPU. For ease of reference,this is referred to variously in this disclosure as a converged hardwaresolution, a converged device, a converged adaptor, a converged IOcontroller, a “datawise” controller, or the like throughout thisdisclosure, and such terms should be understood to encompass, exceptwhere context indicates otherwise, a converged storage and networkcontroller in hardware that combines target storage functions andnetwork functions into a single data and control path.

Among other benefits, the converged solution will increase rawperformance of a cluster of computing and/or storage resources; enforceservice level agreements (SLAs) across the cluster and help guaranteepredictable performance; provide a multi-tenant environment where atenant will not affect its neighbor; provide a denser cluster withhigher utilization of the hardware resulting in smaller data centerfootprint, less power, fewer systems to manage; provide a more scalablecluster; and pool storage resources across the cluster without loss ofperformance.

The various methods and systems disclosed herein provide high-densityconsolidation of resources required for scaleout applications and highperformance multi-node pooling. These methods and systems provide anumber of customer benefits, including dynamic cluster-wide resourceprovisioning, the ability to guarantee quality-of-service (QoS),Security, Isolation etc. on network and storage functions, and theability to use shared infrastructure for production andtesting/development.

Also provided herein are methods and systems to perform storagefunctions through the network and to virtualize storage and networkdevices for high performance and deterministic performance in single ormulti-tenant environments.

Also provided herein are methods and systems for virtualization ofstorage devices, such as those using NVMe and similar protocols, and thetranslation of those virtual devices to different physical devices, suchas ones using SATA.

The methods and systems disclosed herein also include methods andsystems for end-to-end congestion control involving only the hardware onthe host (as opposed to the network fabric) that includes remote creditmanagement and a distributed scheduling algorithm at the box level.

Also provided herein are various methods and systems that are enabled bythe converged network/storage controller, including methods and systemsfor virtualization of a storage cluster or of other elements that enablea cluster, such as a storage adaptor, a network adaptor, a container(e.g., a Linux container), a Solaris zone or the like. Among advantages,one aspect of virtualizing a cluster is that containers can becomelocation-independent in the physical cluster. Among other advantages,this allows movement of containers among machines in a vastly simplifiedprocess described below.

Provided herein are methods and systems for virtualizing direct-attachedstorage (DAS), so that the operating system stack 108 still sees alocal, persistent device, even if the physical storage is moved and isremotely located; that is, provided herein are methods and systems forvirtualization of DAS. In embodiments this may include virtualizing DASover a fabric, that is, taking a DAS storage system and moving itoutside the box and putting it on the network. In embodiments this mayinclude carving DAS into arbitrary name spaces. In embodiments thevirtualized DAS is made accessible as if it were actual DAS to theoperating system, such as being accessible by the OS 108 over a PCIe busvia NVMe. Thus, provided herein is the ability to virtualize storage(including DAS) so that the OS 108 sees it as DAS, even if the storageis actually accessed over a network protocol such as Ethernet, and theOS 108 is not required to do anything different than would be requiredwith local physical storage.

Provided herein are methods and systems for providing DAS across afabric, including exposing virtualized DAS to the OS 108 withoutrequiring any modification of the OS 108.

Also provided herein are methods and systems for virtualization of astorage adaptor (referring to a target storage system).

Provided herein are methods and systems for combining storage initiationand storage targeting in a single hardware system. In embodiments, thesemay be attached by a PCIe bus 110. A single root virtualization function(SR-IOV) may be applied to take any standard device and have it act asif it is hundreds of such devices. Embodiments disclosed herein includeusing SR-IOV to give multiple virtual instances of a physical storageadaptor. SR-IOV is a PCIe standard that virtualizes I/O functions, andwhile it has been used for network interfaces, the methods and systemsdisclosed herein extend it to use for storage devices. Thus, providedherein is a virtual target storage system.

Embodiments may include a switch form factor or network interfacecontroller, wherein the methods and systems disclosed herein may includea host agent (either in software or hardware). Embodiments may includebreaking up virtualization between a front end and a back end.

Embodiments may include various points of deployment for a convergednetwork and target storage controller. While some embodiments locate theconverged device on a host computing system 102, in other cases the diskcan be moved to another box (e.g., connected by Ethernet to a switchthat switches among various boxes below. While a layer may be needed tovirtualize, the storage can be separated, so that one can scale storageand computing resources separately. Also, one can then enable bladeservers (i.e., stateless servers). Installations that would haveformerly involved expensive blade servers and attached to storage areanetworks (SANs) can instead attach to the switch. In embodiments thiscomprises a “rackscale” architecture where resources are disaggregatedat the rack level.

Methods and systems disclosed herein include methods and systems forvirtualizing various types of non-DAS storage as DAS in a convergednetworking/target storage appliance. In embodiments, one may virtualizewhatever storage is desired as DAS, using various front end protocols tothe storage systems while exposing storage as DAS to the OS stack 108.

Methods and systems disclosed herein include virtualization of aconverged network/storage adaptor. From a traffic perspective, one maycombine systems into one. Combining the storage and network adaptors,and adding in virtualization, gives significant advantages. Say there isa single host 102 with two PCIe buses 110. To route from the PCIe 110,you can use a system like RDMA to get to another machine/host 102. Ifone were to do this separately, one has to configure the storage and thenetwork RDMA system separately. One has to join each one and configurethem at two different places. In the converged scenario, the whole stepof setting up QoS, seeing that this is RDMA and that there is anotherfabric elsewhere is a zero touch process, because with combined storageand networking the two can be configured in a single step. That is, onceone knows the storage, one doesn't need to set up the QoS on the networkseparately.

Method and systems disclosed herein include virtualization and/orindirection of networking and storage functions, embodied in thehardware, optionally in a converged network adaptor/storage adaptorappliance. While virtualization is a level of indirection, protocol isanother level of indirection. The methods and systems disclosed hereinmay convert a protocol suitable for use by most operating systems todeal with local storage, such as NVMe, to another protocol, such as SAS,SATA, or the like. One may expose a consistent interface to the OS 108,such as NVMe, and in the back end one may convert to whatever storagemedia is cost-effective. This gives a user a price/performanceadvantage. If components are cheaper/faster, one can connect any one ofthem. The back end could be anything, including NVMe.

Provided herein are methods and systems that include a converged datapath for network and storage functions in an appliance. Alternativeembodiments may provide a converged data path for network and storagefunctions in a switch.

In embodiments, methods and systems disclosed herein includestorage/network tunneling, wherein the tunneling path between storagesystems over a network does not involve the operating system of a sourceor target computer. In conventional systems, one had separate storageand network paths, so accessing storage remotely, required extensivecopying to and from memory, I/O buses, etc. Merging the two paths meansthat storage traffic is going straight onto the network. The OS 108 ofeach computer sees only a local disk. Another advantage is simplicity ofprogramming. A user does not need to separately program a SAN, meaningthat the methods disclosed herein include a one-step programmable SAN.Rather than requiring discovery and specification of zones, and thelike, encryption, attachment, detachment and the like may be centrally,and programmatically done.

Embodiments disclosed herein may include virtualizing the storage to theOS 108 so that the OS 108 sees storage as a local disk. The level ofindirection involved in the methods and systems disclosed herein allowsthe converged system to hide not only the location, but the media type,of storage media. All the OS sees is that there is a local disk, even ifthe actual storage is located remotely and/or is or a different type,such as a SAN. Thus, virtualization of storage is provided, where the OS108 and applications do not have to change. One can hide all of themanagement, policies of tiering, polices of backup, policies ofprotection and the like that are normally needed to configure complexstorage types behind.

Methods and systems are provided for selecting where indirection occursin the virtualization of storage. Virtualization of certain functionsmay occur in hardware (e.g., in an adaptor on a host, in a switch, andin varying form factors (e.g., FPGA or ASICs) and in software. Differenttopologies are available, such as where the methods and systemsdisclosed herein are deployed on a host machine, on a top of the rackswitch, or in a combination thereof. Factors that go into the selectioninclude ease of use. Users who want to run stateless servers may prefera top of rack. Ones who don't care about that approach might prefer thecontroller on the host.

Methods and systems disclosed herein include providing NVMe overEthernet. These approaches can be the basis for the tunneling protocolthat is used between devices. NVMe is a suitable DAS protocol that isintended conventionally to go to a local PCIe. Embodiments disclosedherein may tunnel the NVMe protocol traffic over Ethernet. NVMe(non-volatile memory express) is a protocol that in Linux and Windowsprovides access to PCIe-based Flash Storage. This provides highperformance by by-passing the software stacks used in conventionalsystems.

Embodiments disclosed herein may include providing an NVMe device thatis virtualized and dynamically allocated. In embodiments one may piggyback NVMe, but carve up and virtualize and dynamically allocate an NVMedevice. In embodiments there is no footprint in the software. Theoperating system stays the same (just a small driver that sees theconverged network/storage card). This results in virtual storagepresented like a direct attached disk, but the difference is that now wecan pool such devices across the network.

Provided herein are methods and systems for providing the simplicity ofdirect attached storage (DAS) with the advantages of sharing like in astorage area network (SAN). Each converged appliance in variousembodiments disclosed herein may be a host, and any storage drives maybe local to a particular host but seen by the other hosts (as in a SANor other network-accessible storage). The drives in each box enabled bya network/storage controller of the present disclosure behave like a SAN(that is, are available on the network), but the management methods aremuch simpler. When a storage administrator sets up a SAN, a typicalenterprise may have a whole department setting up zones for a SAN (e.g.,a fiber channel switch), such as setting up “who sees what.” Thatknowledge is pre-loaded and a user has to ask the SAN administrator todo the work to set it up. There is no programmability in a typicallegacy SAN architecture. The methods and systems disclosed hereinprovide local units that are on the network, but the local units canstill access their storage without having to go through complexmanagement steps like zone definition, etc. These devices can do what aSAN does just by having both network and storage awareness. As such,they represent the first programmatic SAN.

Methods and systems disclosed herein may include persistent, stateful,disaggregated storage enabled by a hardware appliance that providesconverged network and storage data management.

Methods and systems disclosed herein may also include convergence ofnetwork and storage data management in a single appliance, adapted tosupport use of containers for virtualization. Such methods and systemsare compatible with the container ecosystem that is emerging, butoffering certain additional advantages.

Methods and systems are disclosed herein for implementing virtualizationof NVMe. Regardless how many sources to how many destinations, as longas the data from the sources is serialized first before going into thehub, then the hub distributes to data to the designated destinationsequentially. If so, then data transport resources such as DMA enginecan be reduced to only one copy. This may include various use scenarios,in one scenario, for NVMe virtual functions (\Ts), if they are allconnected to the same PCIe bus, then regardless how many \Ts areconfigured, the data would be coming into this pool of VFs serially, sothere is only one DMA engine and only one storage block (for controlinformation) is needed. In another use scenario, for a disk storagesystem with a pool of discrete disks/controllers, if the data isoriginated from the physical bus, i.e. PCIe, since the data is seriallycoming into this pool of disks, then regardless how manydisks/controllers are in the pool, the transport resources such as theDMA engine can be reduced to only one instead of one per controller.

In accordance with various exemplary and non-limiting embodiments, adevice comprises a converged input/output controller that includes aphysical target storage media controller, a physical network interfacecontroller; and a gateway between the storage media controller and thenetwork interface controller, wherein gateway provides a directconnection for storage traffic and network traffic between the storagemedia controller and the network interface controller.

In accordance with various exemplary and non-limiting embodiments, amethod of virtualization of a storage device comprises accessing aphysical storage device that responds to instructions in a first storageprotocol, translating instructions between the first storage protocoland a second storage protocol and using the second protocol, presentingthe physical storage device to an operating system, such that thestorage of the physical storage device can be dynamically provisioned,whether the physical storage device is local or remote to a hostcomputing system that uses the operating system.

In accordance with various exemplary and non-limiting embodiments, amethod of facilitating migration of at least one of an application and acontainer comprises providing a converged storage and networkingcontroller, wherein a gateway provides a connection for network andstorage traffic between a storage component and a networking componentof the device without intervention of the operating system of a hostcomputer and mapping the at least one application or container to atarget physical storage device that is controlled by the convergedstorage and networking controller, such that the application orcontainer can access the target physical storage, without interventionof the operating system of the host system to which the target physicalstorage is attached, when the application or container is moved toanother computing system.

In accordance with various exemplary and non-limiting embodiments, amethod of providing quality of service (QoS) for a network, comprisesproviding a converged storage and networking controller, wherein agateway provides a connection for network and storage traffic between astorage component and a networking component of the device withoutintervention of the operating system, a hypervisor, or other softwarerunning on the CPU of a host computer and, also without intervention ofthe operating system, hypervisor, or other software running on the CPUof a host computer, managing at least one quality of service (QoS)parameter related to a network in the data path of which the storage andnetworking controller is deployed, such managing being based on at leastone of the storage traffic and the network traffic that is handled bythe converged storage and networking controller.

QoS may be based on various parameters, such as one or more of abandwidth parameter, a network latency parameter, an IO performanceparameter, a throughput parameter, a storage type parameter and astorage latency parameter. QoS may be maintained automatically when atleast one of an application and a container that is serviced by storagethrough the converged storage and network controller is migrated from ahost computer to another computer. Similarly, QoS may be maintainedautomatically when at least one target storage device that services atleast one of an application and a container through the convergedstorage and network controller is migrated from a first location toanother location or multiple locations. For example, storage may bescaled, or different storage media types may be selected, to meetstorage needs as requirements are increased. In embodiments, a securityfeature may be provided, such as encryption of network traffic data,encryption of data in storage, or both. Various storage features may beprovided as well, such as compression, protection levels (e.g., RAIDlevels), use of different storage media types, global de-duplication,and snapshot intervals for achieving at least one of a recovery pointobjective (RPO) and a recovery time objective (RTO).

BRIEF DESCRIPTION OF THE FIGURES

The accompanying figures where like reference numerals refer toidentical or functionally similar elements throughout the separate viewsand which together with the detailed description below are incorporatedin and form part of the specification, serve to further illustratevarious embodiments and to explain various principles and advantages allin accordance with the systems and methods disclosed herein.

FIG. 1 illustrates a general architecture in accordance with anexemplary and non-limiting embodiment.

FIG. 2 illustrates a computer system in accordance with an exemplary andnon-limiting embodiment.

FIG. 3 illustrates a converged solution in accordance with an exemplaryand non-limiting embodiment.

FIG. 4 illustrates two computing systems enabled by a converged solutionin accordance with an exemplary and non-limiting embodiment.

FIG. 5 illustrates a converged controller in accordance with anexemplary and non-limiting embodiment.

FIG. 6 illustrates a deployment of a converged controller in accordancewith an exemplary and non-limiting embodiment.

FIG. 7 illustrates a plurality of systems in accordance with anexemplary and non-limiting embodiment.

FIG. 8 illustrates a block diagram of a field-programmable gate array(FPGA) in accordance with an exemplary and non-limiting embodiment.

FIG. 9 illustrates an architecture of a controller card in accordancewith an exemplary and non-limiting embodiment.

FIG. 10 illustrates a software stack in accordance with an exemplary andnon-limiting embodiment.

FIGS. 11-15 illustrate the movement of an application container acrossmultiple systems in accordance with an exemplary and non-limitingembodiment.

FIG. 16 illustrates packet transmission in accordance with an exemplaryand non-limiting embodiment.

FIG. 17 illustrates a storage access scheme in accordance with anexemplary and non-limiting embodiment.

FIG. 18 illustrates the operation of a file system in accordance with anexemplary and non-limiting embodiment.

FIG. 19 illustrates the operation of a distributed file server inaccordance with an exemplary and non-limiting embodiment.

FIG. 20 illustrates a high performance distributed file server (DFS) inaccordance with an exemplary and non-limiting embodiment.

FIG. 21 illustrates a system in accordance with an exemplary andnon-limiting embodiment.

FIG. 22 illustrates a host in accordance with an exemplary andnon-limiting embodiment.

FIG. 23 illustrates an application accessing a block of data inaccordance with an exemplary and non-limiting embodiment.

FIG. 24 illustrates an application accessing a block of data inaccordance with an exemplary and non-limiting embodiment.

FIG. 25 illustrates a system in accordance with an exemplary andnon-limiting embodiment.

FIG. 26 illustrates a method according to an exemplary and non-limitingembodiment.

FIG. 27 illustrates a method according to an exemplary and non-limitingembodiment.

FIG. 28 illustrates a method according to an exemplary and non-limitingembodiment.

Skilled artisans will appreciate that elements in the figures areillustrated for simplicity and clarity and have not necessarily beendrawn to scale. For example, the dimensions of some of the elements inthe figures may be exaggerated relative to other elements to help toimprove understanding of embodiments of the systems and methodsdisclosed herein.

DETAILED DESCRIPTION OF THE INVENTION

The present disclosure will now be described in detail by describingvarious illustrative, non-limiting embodiments thereof with reference tothe accompanying drawings and exhibits. The disclosure may, however, beembodied in many different forms and should not be construed as beinglimited to the illustrative embodiments set forth herein. Rather, theembodiments are provided so that this disclosure will be thorough andwill fully convey the concept of the disclosure to those skilled in theart. The claims should be consulted to ascertain the true scope of thedisclosure.

Before describing in detail embodiments that are in accordance with thesystems and methods disclosed herein, it should be observed that theembodiments reside primarily in combinations of method steps and/orsystem components related to converged networking and storage.Accordingly, the system components and method steps have beenrepresented where appropriate by conventional symbols in the drawings,showing only those specific details that are pertinent to understandingthe embodiments of the systems and methods disclosed herein so as not toobscure the disclosure with details that will be readily apparent tothose of ordinary skill in the art.

Referring to FIG. 3, the converged solution 300 may include threeimportant aspects and may be implemented in a hardware device thatincludes a combination of hardware and software modules and functions.First, a cut-through data path 304 may be provided between a networkcontroller 118 and a storage controller 112, so that access of thestorage to and from the network can be direct, without requiring anyintervention of the OS stack 108, the PCIe bus 110, or the CPU 106.Second, cut through storage stack access, such as to storage devices302, may be provided, such as access of the storage to and from entitieson the local host, which allows bypassing of complex legacy softwarestacks for storage access, such as SCSI/SAS/SATA stacks. Third,end-to-end congestion management and flow control of the network may beprovided, such as by a mechanism to reserve and schedule the transfer ofdata across the network, which guarantees the availability of thetarget's data to remote initiators and minimizes the congestion of thetraffic as it flows through intermediate network fabric switches. Thefirst and second aspects remove software stacks (hence the CPU 106 andmemory) from the path of the data, eliminating redundant or unnecessarymovement and processing. End-to-end congestion management and flowcontrol delivers a deterministic and reliable transport of the data.

As noted above, one benefit of the converged solution 300 is that theoperating system stack 108 connects to the converged solution 300 over aconventional PCIe 110 or a similar bus, so that the OS stack 108 seesthe converged solution 300, and any storage that it controls through thecut-through to storage devices 302, as one or more local, persistentdevices, even if the physical storage is remotely located. Among otherthings, this comprises the capability for virtualization of DAS 308,which may include virtualizing DAS 308 over a fabric, that is, taking aDAS 308 storage system and moving it outside the computing system 102and putting it on the network. The storage controller 112 of theconverged solution 300 may connect to and control DAS 308 on the network122 via various known protocols, such as SAS, SATA, or NVMe. Inembodiments virtualization may include carving DAS 308 into arbitraryname spaces. In embodiments the virtualized DAS 308 is made accessibleas if it were actual, local, physical DAS to the operating system, suchas being accessible by the OS 108 over a PCIe bus 110 to the storagecontroller 112 of the converged solution 300 via a standard protocolsuch as NVMe. Again, the OS 108 sees the entire solution 300 as a local,physical device, such as DAS. Thus, provided herein is the ability tovirtualize storage (including DAS and other storage types, such as SAN310) so that the OS 108 sees any storage type as DAS, even if thestorage is actually accessed over a network 122, and the OS 108 is notrequired to do anything different than would be required with localphysical storage. In the case where the storage devices 302 are SAN 310storage, the storage controller 112 of the converged solution maycontrol the SAN 310 through an appropriate protocol used for storagearea networks, such as the Internet Small Computing System Interface(iSCSI), Fibre Channel (FC), or Fibre Channel over Ethernet (FCoE).Thus, the converged solution 300 provides a translation for the OS stack108 from any of the other protocols used in storage, such as Ethernet,SAS, SATA, NVMe, iSCSI, FC or FCoE, among others, to a simple protocollike NVMe that makes the disparate storage types and protocols appear aslocal storage accessible over PCIe 110. This translation in turnsenables virtualization of a storage adaptor (referring to any kind oftarget storage system). Thus, methods and systems disclosed hereininclude methods and systems for virtualizing various types of non-DASstorage as DAS in a converged networking/target storage appliance 300.In embodiments, one may virtualize whatever storage is desired as DAS,using various protocols to the storage systems while exposing storage asDAS to the OS stack 108. Thus, provided herein are methods and systemsfor virtualization of storage devices, such as those using NVMe andsimilar protocols, and the translation of those virtual devices todifferent physical devices, such as ones using SATA.

Storage/network tunneling 304, where the tunneling path between storagesystems over the network 122 does not involve the operating system of asource or target computer enables a number of benefits. In conventionalsystems, one has separate storage and network paths, so accessingstorage remotely required extensive copying to and from memory, I/Obuses, etc. Merging the two paths means that storage traffic is goingstraight onto the network. An advantage is simplicity of programming. Auser does not need to separately program a SAN 310, meaning that themethods disclosed herein enable a one-step programmable SAN 310. Ratherthan requiring discovery and specification of zones, and the like,configuration, encryption, attachment, detachment and the like may becentrally, and programmatically done. As an example, a typical SAN iscomposed of “initiators,” “targets,” and a switch fabric, which connectsthe initiators and targets. Typically which initiators see which targetsare defined/controlled by the fabric switches, called “zones.”Therefore, if an initiator moves or a target moves, zones need to beupdated. The second control portion of a SAN typically lies with the“targets.” They can control which initiator port can see what logicalunit numbers (LUNs) (storage units exposed by the target). This istypically referred to as LUN masking and LUN mapping. Again, if aninitiator moves locations, one has to re-program the “Target”. Considernow that in such an environment if an application moves from one host toanother (such as due to a failover, load re-balancing, or the like) thezoning and LUN masking/mapping needs to be updated. Alternatively, onecould pre-program the SAN, so that every initiator sees every target.However, doing so results in an un-scalable and un-secure SAN. In thealternate solution described throughout this disclosure, such a movementof an application, a container, or a storage device does NOT require anySAN re-programming, resulting in a zero touch solution. The mappingmaintained and executed by the converged solution 300 allows anapplication or a container, the target storage media, or both, to bemoved (including to multiple locations) and scaled independently,without intervention by the OS, a hypervisor, or other software runningon the host CPU.

The fact that the OS 108 sees storage as a local disk allows simplifiedvirtualization of storage. The level of indirection involved in themethods and systems disclosed herein allows the converged system 300 tohide not only the location, but the media type, of storage media. Allthe OS 108 sees is that there is a local disk, even if the actualstorage is located remotely and/or is or a different type, such as a SAN310. Thus, virtualization of storage is provided through the convergedsolution 300, where the OS 108 and applications do not have to change.One can hide all of the management, policies of tiering, polices ofbackup, policies of protection and the like that are normally needed toconfigure complex storage types behind.

The converged solution 300 enables the simplicity of direct attachedstorage (DAS) with the advantages of a storage area network (SAN). Eachconverged appliance 300 in various embodiments disclosed herein may actas a host, and any storage devices 302 may be local to a particular hostbut seen by the other hosts (as is the case in a SAN 310 or othernetwork-accessible storage). The drives in each box enabled by anetwork/storage controller of the present disclosure behave like a SAN310 (e.g., are available on the network), but the management methods aremuch simpler. When a storage administrator normally sets up a SAN 310, atypical enterprise may have a whole department setting up zones for aSAN 310 (e.g., a fiber channel switch), such as setting up “who seeswhat.” That knowledge must be pre-loaded, and a user has to ask the SAN310 administrator to do the work to set it up. There is noprogrammability in a typical legacy SAN 310 architecture. The methodsand systems disclosed herein provide local units that are on thenetwork, but the local units can still access their storage withouthaving to go through complex management steps like zone definition, etc.These devices can do what a SAN does just by having both network andstorage awareness. As such, they represent the first programmatic SAN.

The solution 300 can be described as a “Converged IO Controller” thatcontrols both the storage media 302 and the network 122. This convergedcontroller 300 is not just a simple integration of the storagecontroller 112 and the network controller (NIC) 118. The actualfunctions of the storage and network are merged such that storagefunctions are performed as the data traverses to and from the networkinterface. The functions may be provided in a hardware solution, such asan FPGA (one or more) or ASIC (one or more) as detailed below.

Referring to FIG. 4, two or more computing systems 102 enabled byconverged solutions 300 may serve as hosts for respective storagetargets, where by merging storage and network and controlling bothinterfaces, direct access to the storage 302 can be achieved remotelyover the network 122 without traversing internal busses or CPU/softwarework, such as by a point-to-point path 400 or by an Ethernet switch 402to another computer system 102 that is enabled by a converged solution300. The highest performance (high IOPs and low latency) can beachieved. Further, storage resources 302 can now be pooled across thecluster. In FIG. 4, this is conceptually illustrated by the dotted oval404.

In embodiments, the converged solution 300 may be included on a hostcomputing system 102, with the various components of a conventionalcomputing system as depicted in FIG. 1, together with the converged IOcontroller 300 as described in connection with FIG. 3. Referring to FIG.5, in alternative embodiments, the converged controller 300 may bedisposed in a switch, such as a top of the rack switch, thus enabling astorage enabled switch 500. The switch may reside on the network 122 andbe accessed by a network controller 118, such as of a conventionalcomputing system 102.

Referring to FIG. 6, systems may be deployed in which a convergedcontroller 300 is disposed both on one or more host computing systems102 and on a storage enabled switch 500, which may be connected tosystems 102 that are enabled by converged solutions 300 and tonon-enabled systems 102. As noted above, target storage 302 for theconverged controller(s) 300 on the host computing system 102 and on thestorage enabled switch 500 can be visible to each other across thenetwork, such as being treated as a unified resource, such as tovirtualization solutions. In sum, intelligence, including handlingconverged network and storage traffic on the same device, can be locatedin a host system, in a switch, or both in various alternativeembodiments of the present disclosure.

Embodiments disclosed herein may thus include a switch form factor or anetwork interface controller, or both which may include a host agent(either in software or hardware). These varying deployments allowbreaking up virtualization capabilities, such as on a host and/or on aswitch and/or between a front end and a back end. While a layer may beneeded to virtualize certain functions, the storage can be separated, sothat one can scale storage and computing resources separately. Also, onecan then enable blade servers (i.e., stateless servers). Installationsthat would have formerly involved expensive blade servers and attachedstorage area networks (SANs) can instead attach to the storage enabledswitch 500. In embodiments this comprises a “rackscale” architecture,where resources are disaggregated at the rack level.

Methods and systems are provided for selecting where indirection occursin the virtualization of storage. Virtualization of certain functionsmay occur in hardware (e.g., in a converged adaptor 300 on a host 102,in a storage enabled switch 500, in varying hardware form factors (e.g.,FPGAs or ASICs) and in software. Different topologies are available,such as where the methods and systems disclosed herein are deployed on ahost machine 102, on a top of the rack switch 500, or in a combinationthereof. Factors that go into the selection of where virtualizationshould occur include ease of use. Users who want to run statelessservers may prefer a top of rack storage enabled switch 500. Ones whodon't care about that approach might prefer the converged controller 300on the host 102.

FIG. 7 shows a more detailed view of a set of systems that are enabledwith converged controllers 300, including two computer systems 102(computer system 1 and computer system 2), as well as a storage enabledswitch 500. Storage devices 302, such as DAS 308 and SAN 310 may becontrolled by the converged controller 300 or the storage enabled switch500. DAS 308 may be controlled in either case using SAS, SATA or NVMeprotocols. SAN 310 may be controlled in either case using iSCSI, FC orFCoE. Connections among hosts 102 that have storage controllers 300 maybe over a point-to-point path 400, over an Ethernet switch 402, orthrough a storage enabled switch 500, which also may provide aconnection to a conventional computing system. As noted above, themultiple systems with intelligent converged controllers 300 can eachserve as hosts and as storage target locations that the other hosts see,thereby providing the option to be treated as a single cluster ofstorage for purposes of an operating system 108 of a computing system102.

Method and systems disclosed herein include virtualization and/orindirection of networking and storage functions, embodied in thehardware converged controller 300, optionally in a converged networkadaptor/storage adaptor appliance 300. While virtualization is a levelof indirection, protocol is another level of indirection. The methodsand systems disclosed herein may convert a protocol suitable for use bymost operating systems to deal with local storage, such as NVMe, toanother protocol, such as SAS, SATA, or the like. One may expose aconsistent interface to the OS 108, such as NVMe, and on the other sideof the converged controller 300 one may convert to whatever storagemedia 302 is cost-effective. This gives a user a price/performanceadvantage. If components are cheaper/faster, one can connect any one ofthem. The side of the converged controller 300 could face any kind ofstorage, including NVMe. Furthermore the storage media type may be anyof the following including, but not limited, to HDD, SSD (based on SLC,MLC, or TLC Flash), RAM etc or a combination thereof.

In embodiments, a converged controller may be adapted to virtualize NVMevirtual functions, and to provide access to remote storage devices 302,such as ones connected to a storage-enabled switch 500, via NVMe over anEthernet switch 402. Thus, the converged solution 300 enables the use ofNVMe over Ethernet 700, or NVMeoE. Thus, methods and systems disclosedherein include providing NVMe over Ethernet. These approaches can be thebasis for the tunneling protocol that is used between devices, such asthe host computing system 102 enabled by a converged controller 300and/or a storage enabled switch 500. NVMe is a suitable DAS protocolthat is intended conventionally to go to a local PCIe 110. Embodimentsdisclosed herein may tunnel the NVMe protocol traffic over Ethernet.NVMe (non-volatile memory express) is a protocol that in Linux andWindows provides access to PCIe-based Flash. This provides highperformance via by-passing the software stacks used in conventionalsystems, while avoiding the need to translate from NVMe (as used by theOS stack 108) and the traffic tunneled over Ethernet to other devices.

FIG. 8 is a block diagram of an FPGA 800, which may reside on an IOcontroller card and enable an embodiment of a converged solution 300.Note that while a single FPGA 800 is depicted, the various functionalblocks could be organized into multiple FPGAs, into one or more customerApplication Specific Integrated Circuits (ASICs), or the like. Forexample, various networking blocks and various storage blocks could behandled in separate (but interconnected) FPGAs or ASICs. Referencesthroughout this disclosure to an FPGA 800 should be understood, exceptwhere context indicates otherwise, to encompass these other forms ofhardware that can enable the functional capabilities reflected in FIG. 8and similar functions. Also, certain functional groups, such as fornetworking functions and/or storage functions, could be embodied inmerchant silicon.

The embodiment of the FPGA 800 of FIG. 8 has four main interfaces.First, there is PCIe interface, such as to the PCIe bus 110 of a hostcomputer 102. Thus, the card is a PCIe end point. Second, there is aDRAM/NVRAM interface. For example, a DDR interface may be provided toexternal DRAM or NVRAM, used by the embedded CPUs, meta-data and datastructures, and packet/data buffering. Third, there is a storageinterface to media, such as DAS 308 and SAN 310. Storage interfaces caninclude ones for SAS, SATA, NVMe, iSCSI, FC and/or FCoE, and could inembodiments be any interface to rotating media, flash, or otherpersistent form of storage, either local or over a cut-through to anetwork-enabled storage like SAN 310. Fourth, a network interface isprovided, such as Ethernet to a network fabric. The storage interfacesand the network interfaces can be used, in part, to enable NVMe overEthernet.

The internal functions of the FPGA 800 may include a number of enablingfeatures for the converged solution 300 and other aspects of the presentdisclosure noted throughout. A set of virtual endpoints (vNVMe) 802 maybe provided for the host. Analogous to the SR-IOV protocol that is usedfor the network interface, this presents virtual storage targets to thehost. In this embodiment of the FPGA 800, NVMe has benefits of lowsoftware overhead, which in turn provides high performance. A virtualNVMe device 802 can be dynamically allocated/de-allocated/moved andresized. As with SR-IOV, there is one physical function (PF) 806 thatinterfaces with a PCIe driver 110 (see below), and multiple virtualfunctions 807 (VF) in which each appears as an NVMe device.

Also provided in the FPGA 800 functions are one or more read and writedirect memory access (DMA) queues 804, referred to in some cases hereinas a DMA engine 804. These may include interrupt queues, doorbells, andother standard functions to perform DMA to and from the host computingsystem 102.

A device mapping facility 808 on the FPGA 800 may determine the locationof the virtual NVMe devices 802. The location options would be local(ie—attached to one of the storage media interfaces 824 shown), orremote on another host 102 of a storage controller 300. Access to aremote vNVMe device requires going through a tunnel 828 to the network122.

A NVMe virtualization facility 810 may translate NVMe protocolinstructions and operations to the corresponding protocol and operationsof the backend storage media 302, such as SAS or SATA (in the case ofuse of NVMe on the backend storage medium 302, no translation may beneeded) where DAS 308 is used, or such as iSCSI, FC or FCoE in the casewhere SAN 310 storage is used in the backend. References to the backendhere refer to the other side of the converged controller 300 from thehost 102.

A data transformation function 812 may format the data as it is storedonto the storage media 302. These operations could include re-writes,transformation, compression, protection (such as RAID), encryption andother functions that involve changing the format of the data in any wayas necessary to allow it to be handled by the applicable type of targetstorage medium 308. In some embodiments, storage medium 308 may beremote.

In embodiments, storage read and write queues 814 may include datastructures or buffering for staging data during a transfer. Inembodiments, temporary memory, such as DRAM of NVRAM (which may belocated off the FPGA 800) may be used for temporary storage of data.

A local storage scheduler and shaper 818 may prioritize and controlaccess to the storage media 302. Any applicable SLA policies for localstorage may be enforced in the scheduler and shaper 818, which mayinclude strict priorities, weighted round robin scheduling, IOP shapers,and policers, which may apply on a per queue, per initiator, per target,or per c-group basis, and the like.

A data placement facility 820 may implement an algorithm that determineshow the data is laid out on the storage media 302. That may involvevarious placement schemes known to those of skill in the art, such asstriping across the media, localizing to a single device 302, using asubset of the devices 302, or localizing to particular blocks on adevice 302.

A storage metadata management facility 822 may include data structuresfor data placement, block and object i-nodes, compression,deduplication, and protection. Metadata may be stored either in off-FPGA800 NVRAM/DRAM or in the storage media 302.

A plurality of control blocks 824 may provide the interface to thestorage media. These may include SAS, SATA, NVMe, PCIe, iSCSI, FC and/orFCoE, among other possible control blocks, in each case as needed forthe appropriate type of target storage media 302.

A storage network tunnel 828 of the FPGA 800 may provide thetunneling/cut-through capabilities described throughout this disclosurein connection with the converged solution 300. Among other things, thetunnel 828 provides the gateway between storage traffic and networktraffic. It includes encapsulation/de-encapsulation or the storagetraffic, rewrite and formatting of the data, and end-to-end coordinationof the transfer of data. The coordination may be between FPGAs 800across nodes within a host computing system 102 or in more than onecomputing system 102, such as for the point-to-point path 400 describedin connection with FIG. 4. Various functions, such as sequence numbers,packet loss, time-outs, and retransmissions may be performed. Tunnelingmay occur over Ethernet, including by FCoE or NVMeoE.

A virtual network interface card facility 830 may include a plurality ofSR-IOV endpoints to the host 102, presented as virtual network interfacecards. One physical function (PF) 836 may interfaces with a PCIe driver110 (see software description below), and multiple virtual functions(VF) 837, in which each appear as a network interface card (NIC) 118.

A set of receive/transmit DMA queues 832 may include interrupt queues,doorbells, and other standard functions to perform DMA to and from thehost 102.

A classifier and flow management facility 834 may perform standardnetwork traffic classification, typically to IEEE standard 802.1Q classof service (COS) mappings or other priority levels.

An access control and rewrite facility 838 may handle access controllists (ACLs) and rewrite policies, including access control liststypically operating on Ethernet tuples (MAC SA/DA, IP SA/DA, TCP ports,etc.) to reclassify or rewrite packets.

A forwarding function 840 may determines destination of the packet, suchas through layer 2 (L2) or layer 3 (L3) mechanisms.

A set of network receive and transmit queues 842 may handle datastructures or buffering to the network interface. Off-FPGA 800 DRAM maybe used for packet data.

A network/remote storage scheduler and policer 844 may providepriorities and control access to the network interface. SLA policies forremote storage and network traffic may be enforced here, which mayinclude strict priorities, weighted round robin, IOP and bandwidthshapers, and policers on a per queue, per initiator, per target, perc-group, or per network flow basis, and the like.

A local network switch 848 may forward packets between queues in theFPGA, so that traffic does not need to exit the FPGA 800 to the networkfabric 122 if the destination is local to the FPGA 800 or the host 102.

An end-to-end congestion control/credit facility 850 may prevent networkcongestion. This is accomplished with two algorithms. First there may bean end-to-end reservation/credit mechanism with a remote FPGA 800. Thismay be analogous to a SCSI transfer ready function, where the remoteFPGA 800 permits the storage transfer if it can immediately accept thedata. Similarly, the local FPGA 800 allocates credits to remote FPGAs800 as they request a transfer. SLA policies for remote storage may alsobe enforced here. Second there may be a distributed schedulingalgorithm, such as an iterative round-robin algorithm, such as the iSLIPalgorithm for input-queues proposed in the publication “The iSLIPScheduling Algorithm for Input-Queues Switches”, by Nick McKeown,IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 7, NO. 2, APRIL 1999. Thealgorithm may be performed cluster wide using the intermediate networkfabric as the crossbar.

A rewrite, tag, and CRC facility 852 may encapsulate/de-encapulate thepacket with the appropriate tags and CRC protection.

A set of interfaces 854, such as MAC interfaces, may provide aninterface to Ethernet.

A set of embedded CPU and cache complexes 858 may implement a processcontrol plan, exception handling, and other communication to and fromthe local host and network remote FPGAs 800.

A memory controller 860, such as a DDR controller, may act as acontroller for the external DRAM/NVRAM.

As a result of the integration of functions provided by the convergedsolution 300, as embodied in one example by the FPGA 800, providedherein are methods and systems for combining storage initiation andstorage targeting in a single hardware system. In embodiments, these maybe attached by a PCIe bus 110. A single root virtualization function(SR-IOV) or the like may be applied to take any standard device (e.g.,any storage media 302 device) and have it act as if it is hundreds ofsuch devices. Embodiments disclosed herein include using a protocol likeSR-IOV to give multiple virtual instances of a physical storage adaptor.SR-IOV is a PCIe standard that virtualizes I/O functions, and while ithas been used for network interfaces, the methods and systems disclosedherein extend it to use for storage devices. Thus, provided herein is avirtualized target storage system. In embodiments the virtual targetstorage system may handle disparate media as if the media are a disk ordisks, such as DAS 310.

Enabled by embodiments like the FPGA 800, embodiments of the methods andsystems disclosed herein may also include providing an NVMe device thatis virtualized and dynamically allocated. In embodiments one maypiggyback the normal NVMe protocol, but carve up, virtualize anddynamically allocate the NVMe device. In embodiments there is nofootprint in the software. The operating system 108 stays the same ornearly the same (possibly having a small driver that sees the convergednetwork/storage card 300). This results in virtual storage that lookslike a direct attached disk, but the difference is that now we can poolsuch storage devices 302 across the network 122.

Methods and systems are disclosed herein for implementing virtualizationof NVMe. Regardless how many sources are related to how manydestinations, as long as the data from the sources is serialized firstbefore going into the hub, then the hub distributes to data to thedesignated destination sequentially,If so, then data transport resourcessuch as the DMA queues 804, 832 can be reduced to only one copy. Thismay include various use scenarios. In one scenario, for NVMe virtualfunctions (VFs), if they are all connected to the same PCIe bus 110,then regardless how many VFs 807 are configured, the data would becoining into this pool of VFs 807 serially, so there is only one DMAengine 804, and only one storage block (for control information) isneeded,

In another use scenario, for a disk storage system with a pool ofdiscrete disks/controllers, if the data is originated from the physicalbus, i,e. PCIe 110, since the data is serially coming into this pool ofdisks, then regardless how many disks/controllers are in the pool, thetransport resources such as the DMA engine 804 can be reduced to onlyone instead of one per controller.

Methods and systems disclosed herein may also include virtualization ofa converged network/storage adaptor 300. From a traffic perspective, onemay combine systems into one. Combining the storage and networkadaptors, and adding in virtualization, gives significant advantages.Say there is a single host 102 with two PCIe buses 110. To route fromthe PCIe 110, you can use a system like remote direct memory access(RDMA) to get to another machine/host 102. If one were to do thisseparately, one has to configure the storage and the network RDMAsystems separately. One has to join each one and configure them at twodifferent places. In the converged solution 300, the whole step ofsetting up QoS, seeing that this is RDMA and that there is anotherfabric elsewhere is a zero touch process, because with combined storageand networking the two can be configured in a single step. That is, onceone knows the storage, one doesn't need to set up the QoS on the networkseparately. Thus, single-step configuration of network and storage forRDMA solutions is enabled by the converged solution 300.

Referring again to FIG. 4, remote access is enabled by the FPGA 800 orsimilar hardware as described in connection with FIG. 8. Thevirtualization boundary is indicated in FIG. 4 by the dotted line 408.To the left of this line, virtual storage devices (e.g., NVMe 802) andvirtual network interfaces 830 are presented to the operating system108. The operating system cannot tell these are virtual devices. To theright of the virtualization boundary 408 are physical storage devices302 (e.g., using SATA or other protocols noted above) and physicalnetwork interfaces. Storage virtualization functions are implemented bythe vNVMe 802 and the NVMe virtualization facility 810 of FIG. 8.Network virtualization functions are implemented by the vNIC facility830. Location of the physical storage media is also hidden from theoperating system 108. Effectively, the physical disks 302 across serverscan be pooled and accessed remotely. The operating system 108 issues aread or write transaction to the storage media 302 (it is a virtualdevice, but the operation system 108 sees it as a physical device). Ifthe physical storage media 302 happens to be remote, the read/writetransaction is mapped to the proper physical location, encapsulated, andtunneled through Ethernet. This process may be implemented by the devicemapping facility 808, the NVMe virtualization facility 810, the datatransformation facility 812 and the storage-network tunnel 828 of FIG.8. The target server (second computing system) un-tunnels the storageread/write and directly accesses its local storage media 302. If thetransaction is a write, the data is written to the media 302. If thetransaction is a read, the data is prepared, mapped to the originserver, encapsulated, and tunneled through Ethernet. The transactioncompletion arrives at the origin operating system 102. In a conventionalsystem, these steps would require software intervention in order toprocess the storage request, data formatting, and network access. Asshown, all of these complex software steps are avoided.

Referring to FIG. 9, a simplified block diagram is provided of anarchitecture of a controller card 902, as one embodiment of a convergedsolution 300 as described throughout this disclosure. The controllercard 902 may be, for example, a standard, full-height, half-length PCIecard, such as a Gen3 ×16 card. However, a non-standard card size isacceptable, preferably sized so that it can fit into various types oftargeted chassis. The PCIe form factor limits the stack up and layersused on the PCB.

The controller card 902 may be used as an add-on card on a commoditychassis, such as a 2RU, 4 node chassis. Each node of the chassis (calleda sled) is typically 1RU and 6.76″ wide. The motherboard typically mayprovide a PCIe Gen3 ×16 connector near the back. A riser card may beused to allow the Controller card 902 to be installed on top of themotherboard; thus, the clearance between the card and the motherboardmay be limited to roughly on slot width.

In embodiments, the maximum power supplied by the PCIe connector is 75W. The controller card 902 may consume about 60 W or less.

The chassis may provide good airflow, but the card should expect a 10Crise in ambient temperature, because in this example the air will bewarmed by dual Xeon processors and 16 DIMMs. The maximum ambienttemperature for most servers is 35C, so the air temperature at thecontroller card 902 will likely be 45C or higher in some situations.Custom heat sinks and baffles may be considered as part of the thermalsolution.

There are two FPGAs in the embodiment of the controller card 902depicted in FIG. 9, a datapath FPGA, or datapath chip 904, and anetworking FPGA, or networking chip 908.

The datapath chip 904 provides connectivity to the host computer 102over the PCIe connector 110. From the host processor's point of view,the controller card 902 looks like multiple NVMe devices. The datapathchip 904 bridges NVMe to standard SATA/SAS protocol and in thisembodiment controls up to six external disk drives over SATA/SAS links.Note that SATA supports up to 6.0 Gbps, while SAS supports up to 12.0Gbps.

The networking chip 908 switches the two 10G Ethernet ports of the NICdevice 118 and the eCPU 1018 to two external 10G Ethernet ports. It alsocontains a large number of data structures for used in virtualization.

The motherboard of the host 102 typically provides a PCIe Gen3 ×16interface that can be divided into two separate PCIe Gen3 ×8 busses inthe Intel chipset. One of the PCIe Gen3 ×8 bus 110 is connected to theIntel NIC device 118. The second PCIe Gen3 ×8 bus 110 is connected to aPLX PCIe switch chip 1010. The downstream ports of the switch chip 1010are configured as two PCIe Gen3 ×8 busses 110. One of the busses 110 isconnected to the eCPU while the second is connected to the datapath chip904.

The datapath chip 904 uses external memory for data storage. A singlex72 DDR3 channel 1012 should provide sufficient bandwidth for mostsituations. The networking chip 908 also uses external memory for datastorage, and a single x72 DDR3 channel is likely to be sufficient formost situations. In addition, the data structures require the use ofnon-volatile memory, such as one that provides high performance andsufficient density, such as Non-volatile DIMM (NVDIMM, which typicallyhas a built-in power switching circuit and super-capacitors as energystorage elements for data retention.

The eCPU 1018 communicates with the networking 908 using two sets ofinterfaces. It has a PCIe Gen2×4 interface for NVMe-like communication.The eCPU 1018 also has two 10G Ethernet interfaces that connect to thenetworking chip 908, such as through its L2 switch.

An AXI bus 1020 (a bus specification of the ARM chipset) will be usedthroughout the internal design of the two chips 904, 908. To allowseamless communication between the datapath chip 904 and the networkingchip 908, the AXI bus 1020 is used for chip-to-chip connection. TheXilinx Aurora™ protocol, a serial interface, may be used as the physicallayer.

The key requirements for FPGA configuration are that (1) The datapathchip 904 must be ready before PCIe configuration started (QSPI Flashmemory (serial flash memory with quad SPI bus interface) may be fastenough) and (2) the chips are preferably field upgradeable. The Flashmemory for configuration is preferably large enough to store at least 3copies of the configuration bitstream. The bitstream refers to theconfiguration memory pattern used by Xilinx™ FPGAs. The bitstream istypically stored in non-volatile memory and is used to configure theFPGA during initial power-on. The eCPU 1018 may be provided with afacility to read and write the configuration Flash memories. Newbitstreams may reside with the processor of the host 102. Security andauthentication may be handled by the eCPU 1018 before attempting toupgrade the Flash memories.

In a networking subsystem, the Controller card 902 may handle allnetwork traffic between the host processor and the outside world. TheNetworking chip 908 may intercept all network traffics from the NIC 118and externally.

The Intel NIC 118 in this embodiment connects two 10GigE, XFI interfaces1022 to the Networking chip 908. The embedded processor will do thesame. The Networking chip 908 will perform an L2 switching function androute Ethernet traffic out to the two external 10GigE ports. Similarly,incoming 10GigE traffic will be directly to either the NIC 118, the eCPU1018, or internal logic of the Networking chip 908.

The controller card 902 may use SFP+ optical connectors for the twoexternal 10G Ethernet ports. In other embodiments, the card may support10GBASE-T using an external PHY and RJ45 connectors; but a separate cardmay be needed, or a custom paddle card arrangement may be needed toallow switching between SFP+ and RJ45.

All the management of the external port and optics, including theoperation of the LEDs, may be controlled by the Networking chip 908.Thus, signals such as PRST, I2C/MDIO, etc may be connected to theNetworking chip 908 instead of the NIC 118.

In a storage subsystem, the Datapath chip 904 may drive the mini-SAS HDconnectors directly. In embodiments such as depicted in FIG. 10, thesignals may be designed to operate at 12 Gbps to support the latest SASstandard.

To provide efficient use of board space, two ×4 mini-SAS HD connectorsmay be used. All eight sets of signals may be connected to the Datapathchip 904, even though only six sets of signals might be used at any onetime.

On the chassis, high-speed copper cables may be used to connect themini-SAS HD connectors to the motherboard. The placement of the mini-SASHD connectors may take into account the various chassis' physical spaceand routing of the cables.

The power to the controller card 902 may be supplied by the PCIe ×16connector. No external power connection needs to be used. Per PCIespecification, the PCIe ×16 connector may supply only up to 25 W ofpower after power up. The controller card 902 may be designed such thatit draws less than 25 W until after PCIe configuration. Thus, a numberof interfaces and components may need to be held in reset after initialpower up. The connector may supply up to 75 W of power afterconfiguration, which may be arranged such that the 75 W is split betweenthe 3.3V and 12V rails.

FIG. 10 shows a software stack 1000, which includes a driver 1002 tointerface to the converged solution 300, such as one enabled by the FPGA800. The NVMe controller 1004 is the set of functions of the hardware(e.g., FPGA 800) that serves the function of an NVMe controller andallocates virtual devices 1012 to the host. In FIG. 10, dev1, dev2, dev3are examples of virtual devices 1012 which are dynamically allocated tocontainers 1018 LXC1, LXC2, and LXC3, respectively. The NVMe to SATAbridge 1008 is the part of the hardware sub-system (e.g., FPGA 800) thatconverts and maps virtual devices 1012 (dev1, dev2, dev3) to storagedevices 302 (e.g., SSDs in the figure). The connection 1010 is the partof the hardware system that provides a SATA connection (among otherpossible connection options noted above). The Ethernet link 120, whichcan expose virtual devices 1012 (i.e dev1, dev2, dev3) to other host(s)102 connected via the Ethernet link 120 using a storage tunnelingprotocol. The PCI-E (NVMe driver) 1002 may program and drive thehardware subsystem for the storage side. This driver 1002 may run on thehost as part of the operating system (e.g., Linux OS in this example).The block layer 1014 may be a conventional SCSI sub-system of the Linuxoperating system, which may interface with the converged solution PCIedriver 1002 to expose virtual storage devices 1012. The containers 1018(LXC1, LXC2, LXC3) may request and dynamically be allocated virtualstorage devices 1012 (dev1, dev2 and dev3, respectively).

FIGS. 11 through 15 show an example of the movement of an applicationcontainer 1018 (e.g., a Linux container) across multiple systems 102,first in the absence of a converged solution 300 and then in thepresence of such a converged solution 300. FIG. 11 shows an example oftwo conventional computer systems 102 with conventional storagecontrollers 112 and network controllers 118 hosting virtualized softwarein an OS/Hypervisor stack 108. Computer System 1 (C1) has aconfiguration similar to the one shown in FIG. 1 with CPU, memory andconventional storage controller 112 and network controller 118. Thesystem runs an operating system 108, such as Linux™, Microsoft Windows™,etc, and/or hypervisor software, such as Xen, VMware, etc. to providesupport for multiple applications natively or over virtualizedenvironments, such as via virtual machines or containers. In thiscomputer system 102, application App1 1102 is running inside a virtualmachine VM1 1104. Applications App2 1108 and App3 1112 are runningwithin virtualized containers LXC1 1110 and LXC2 1114 respectively. Inaddition to these, application App4 1118 is running natively over theOperating System 108. Although typically, a practical scenario mighthave only virtual machines or containers or native applications (not allthree), here it is depicted in a combined fashion deliberately to coverall cases of virtualized environments. Computer System 2 (C2) 102 hassimilar configuration supporting App5 and App6 in a container andnatively, respectively. Each of these applications access their storagedevices 302 independent of each other, namely App1 uses 51, App2 usesS2, etc. These storage devices 302 (designated S1-S6) are not limited tobeing independent physical entities. They could be logically carved outof one or more physical storage elements as deemed necessary. As one cansee, (represented by the arrow from each storage device 302 to anapplication), the data flow between the storage 302 and the application1102, 1108, 1112, 1118 passes through the storage controller 112 and theoperating system/hypervisor stack 108 before it reaches the application,entailing the challenges described in connection with FIG. 1.

Referring to FIG. 12, when an application or a container is moved fromC1 to C2, its corresponding storage device needs to be moved too. Themovement could be needed due to the fact that C1 might be running out ofresources (such as CPU, memory, etc.) to support the existingapplications (App1-App4) over a period of time, such as because ofbehavioral changes within these applications.

Typically, it is easier to accomplish the movement within a reasonableamount of time as long as the application states and the storage arereasonable in terms of size. Typically storage-intense applications mayuse large amounts (e.g., multiple terabytes) of storage, in which case,it may not be practical to move the storage 302 within an acceptableamount of time. In that case, storage may continue to stay where it wasand software-level shunting/tunneling would be undertaken to access thestorage remotely, as shown in FIG. 13.

As shown in FIG. 13, App2 1108, after its movement to computer systemC2, continues to access its original storage S2 located on computersystem C1 by traversing through Operating Systems or Hypervisors 108 ofboth the systems C1 and C2. This is because the mapping of storageaccess through the network controllers 118 to that storage controller112 and its attached storage devices 302 is done by the Operating Systemor Hypervisor software stack 108 running within the main CPU.

As shown in FIG. 13 after its movement to C2, App2 1108 continues toaccess its original storage S2 located in C1 by traversing throughOperating Systems or Hypervisors 108 of both the systems C1 and C2. Thisis because, the mapping of storage access through the networkcontrollers 118 from C2 to C1 and over to that storage controller 112 ofC1 is done by the Operating System or Hypervisor software 108 runningwithin the main CPU of each computer system.

Consider a similar scenario when a converged controller 300 is appliedas shown in the FIG. 14. As one can see, the scenario is almostidentical to FIG. 11, except the Converged IO Controller 300 replacesthe separate storage controller 112 and network controller 118. In thiscase, when App2 1108 along with its container LXC1 is moved to C2 (asshown in FIG. 15), the storage S2 is not moved, and the access isoptimized by avoiding the traversal through any software (OperatingSystem, Hypervisor 108 or any other) running in main CPU present incomputing system C1.

Thus, provided herein is a novel way of bypassing the main CPU where astorage device is located, which in turn (a) allows one to reducelatency and bandwidth significantly in accessing a storage acrossmultiple computer systems and (b) vastly simplifies and improvessituations in which an application needs to be moved away from a machineon which its storage is located.

Ethernet networks behave on a best effort basis and hence lossy innature as well as bursty. Any packet could be lost forever or bufferedand delivered in bursty and delayed manner along with other packets.Whereas, typical storage centric applications are sensitive to lossesand bursts, it is important that when storage traffic is sent overEthernet networks.

Conventional storage accesses over their buses/networks involve reliableand predictable methods. For example, Fibre Channel networks employcredit based flow control to limit number of accesses made by endsystems. And the number of credits given to an end system is based onwhether the storage device has enough command buffers to receive andfulfill storage requests in predictable amount of time fulfillingrequired latency and bandwidth needs. The figure below shows some creditschemes adopted by different types of buses such as SATA, Fibre Channel(FC), SCSI, SAS, etc.

Referring to FIG. 16, Ethernet networks behave on a best effort basisand hence tend to be lossy in nature, as well as bursty. Any packetcould be lost forever or buffered and delivered in a delayed manner, ina congestion-inducing burst, along with many other packets. Typicalstorage-centric applications are sensitive to losses and bursts, so itis important when storage traffic is sent over buses and Ethernetnetworks, that those involve reliable and predictable methods formaintaining integrity. For example, Fibre Channel networksconventionally employ credit-based flow control to limit the number ofaccesses made by end systems at any one time. The number of creditsgiven to an end system can be based on whether the storage device 302has enough command buffers to receive and fulfill storage requests in apredictable amount of time that satisfies required latency and bandwidthrequirements. FIG. 16 shows some of the credit schemes adopted bydifferent types of buses such as a SATA bus 1602, Fibre Channel (FC)1604, and SCSI/SAS connection 1608, among other types of such schemes.

As one can see, for example, an FC controller 1610 may have its ownbuffering up to a limit of ‘N’ storage commands before sending them toan FC-based storage device 1612, but the FC device 1612 might have adifferent buffer limit, say ‘M’ in this example, which could be greaterthan, equal to, or less than ‘N’. A typical credit-based scheme usestarget level (in this example, one of the storage devices 302, such asthe FC Device 1602, is the target) aggregate credits, information aboutwhich is propagated to various sources (in this example, the controller,such as the FC Controller 1610, is the source) which are trying toaccess the target 302. For example, if two sources are accessing atarget that has a queue depth of ‘N,’ then sum of the credits given tothe sources would not exceed ‘N,’ so that at any given time the targetwill not receive more than ‘N’ commands. The distribution of creditsamong the sources may be arbitrary, or it may be based on various typesof policies (e.g., priorities based on cost/pricing, SLAs, or the like).When the queue is serviced, by fulfilling the command requests, creditsmay be replenished at the sources as appropriate. By adhering to thiskind of credit-based storage access, losses that would result fromqueues at the target being overwhelmed can be avoided.

Typical storage accesses over Ethernet, such as FCOE, iSCSI, and thelike, may extend the target-oriented, credit-based command fulfillmentto transfers over Ethernet links. In such cases, they may be targetdevice-oriented, rather than being source-oriented. Provided herein arenew credit based schemes that can instead be based on which or what kindof source should get how many credits. For example, the convergedsolution 300 described above, which directly interfaces the network tothe storage, may employ a multiplexer to map a source-oriented,credit-based scheduling scheme to a target device oriented credit basedscheme, as shown in FIG. 17.

As shown in FIG. 17, four sources are located over Ethernet and thereare two target storage devices 302. Typical target-oriented,credit-based schemes would expose two queues (one per target), or twoconnections per source to each of the targets. Instead, as shown in FIG.17, the queues (Q1,Q2,Q3,Q4) 1702 are on a per-source basis, and theymapped/multiplexed to two target-oriented queues (Q5,Q6) 1704 across themultiplexor (S) 1708. By employing this type of source-oriented,credit-based scheme, one may guarantee access bandwidth and predictableaccess latency, independent of the number of target storage devices 302.As an example, one type of multiplexing is to make sure queue size ‘P’of Q1 does not exceed ‘L+M’ of Q5 and Q6, so that Q1 is not overwhelmedby its source.

In embodiments, methods and systems to provide access to blocks of datafrom a storage device 302 is described. In particular, a novel approachto allowing an application to access its data, fulfilling a specific setof access requirements is described.

[000149] As used herein, the term “application-driven data storage”(ADS) encompasses storage that provides transparency to any applicationin terms of how the application's data is stored, accessed, transferred,cached and delivered to the application. ADS may allow applications tocontrol these individual phases to address the specific needs of theparticular application. As an example, an application might be comprisedof multiple instances of itself, as well as multiple processes spreadacross multiple Linux nodes across the network. These processes mightaccess multiple files in shared or exclusive manners among them. Basedon how the application wants to handle these files, these processes maywant to access different portions of the files more frequently, may needquick accesses or use once and throw away. Based on these criteria, itmight want to prefetch and/or retain specific portions of a file indifferent tiers of cache and/or storage for immediate access on persession or per file basis as it wishes. These application specificrequirements cannot be fulfilled in a generic manner such as diskstriping of entire file system, prefetching of read-ahead sequentialblocks, reserving physical memory in the server or LRU or FIFO basedcaching of file contents.

Application-driven data storage I/O is not simply applicable to thestorage entities alone. It impacts the entire storage stack in severalways. First, it impacts the storage I/O stack in the computing nodewhere the application is running comprising the Linux paging system,buffering, underlying File system client, TCP/IP stack, classification,QoS treatment and packet queuing provided by the networking hardware andsoftware. Second, it impacts the networking infrastructure thatinterconnects the application node and its storage, comprising Ethernetsegments, optimal path selections, buffering in switches, classificationand QoS treatment of latency-sensitive storage traffic as well asimplosion issues related to storage I/O. Also, it impacts the storageinfrastructure which stores and maintains the data in terms of filescomprising the underlying file layout, redundancy, access time, tieringbetween various types of storage as well as remote repositories.

Methods and systems disclosed herein include ones relating to theelements affecting a typical application within an application node andhow a converged solution 300 may change the status quo to addresscertain critical requirements of applications.

Conventional Linux stacks may consist of simple building blocks, suchgeneric memory allocation, process scheduling, file access, memorymapping, page caching, etc. Although these are essential for anapplication to run on Linux, this is not optimal for certain categoriesof applications that are input/output (IO) intensive, such as NoSQL.NoSQL applications are very IO intensive, and it is harder to predicttheir data access in a generic manner. If applications have to bedeployed in a utility-computing environment, it is not ideal for Linuxto provide generic minimal implementations of these building blocks. Itis preferred for these building blocks to be highly flexible and haveapplication-relevant features that can be controllable from theapplication(s).

Although every application has its own specific requirements, in anexemplary embodiment, the NoSQL class of applications has the followingrequirements which, when addressed by the Linux stack, could greatlyimprove the performance of NoSQL applications and other IO intensiveapplications. The requirements are first, the use of file levelpriority. The Linux file system should provide access level prioritybetween different files at a minimum. For example, an applicationprocess (consisting of multiple threads) accessing two different fileswith one file given higher priority over the other (such as onedatabase/table/index over the other). This would enable the preciousstorage I/O resources be preferentially utilized based on the data beingaccessed. One would argue that this could be indirectly addressed byrunning one thread/process be run at a higher or lower priority, butthose process level priorities are not communicated over to file systemor storage components. Process or thread level priorities are meant onlyfor utilizing CPU resources. Moreover, it is possible that same threadmight be accessing these two files and hence will be utilizing thestorage resources at two different levels based on what data (file)being accessed. Second, there may be a requirement for access levelpreferences. A Linux file system may provide various preferences(primarily SLA) during a session of a file (opened file), such aspriority between file sessions, the amount of buffering of blocks, theretention/life time preferences for various blocks, alerts for resourcethresholds and contentions, and performance statistics. As an example,when a NoSQL application such as MongoDB or Cassandra would have two ormore threads for writes and reads, where if writes may have to be givenpreference over reads, a file session for write may have to be givenpreference over a file session for read for the same file. Thiscapability enables two sessions of the same file to have two differentpriorities.

Many of the NoSQL applications store different types of data into thesame file; for example, MongoDB stores user collections as well as(b-tree) index collections in the same set of database files. MongoDBmay want to keep the index pages (btree and collections) in memory inpreference over user collection pages. When these files are opened,MongoDB may want to influence the Linux, File and storage systems totreat the pages according to MongoDB policies as opposed to treatingthese pages in a generic FIFO or LRU basis agnostic of the application'srequirements.

Resource alerts and performance statistics enable an NoSQL database tounderstand the behavior of the underlying File and storage system andcould service its database queries accordingly or trigger actions to becarried out such as sharding of the database or reducing/increasing ofFile I/O preference for other jobs running in the same host (such asbackup, sharding, number read/write queries serviced, etc.). Forexample, performance stats about min, max and average number of IOPs andlatencies as well as top ten candidate pages thrashed in and out of hostmemory in a given period of time would enable an application to finetune itself dynamically adjusting the parameters noted above.

A requirement may also exist for caching and tiering preferences. ALinux file system may need to have a dynamically configurable cachingpolicy while applications are accessing their files. Currently, Linuxfile systems typically pre-fetch contiguous blocks of a file, hopingthat applications are reading the file in a sequential manner like astream. Although it is true for many legacy applications like webservers and video streamers, emerging NoSQL applications do not followsequential reads strictly. These applications read blocks randomly. Asan example, MongoDB stores the document keys in index tables in b-tree,laid out flat on a portion of a file, which, when a key is searched,accesses the blocks randomly until it locates the key. Moreover, thesefiles are not dedicated to such b-tree based index tables alone. Thesefiles are shared among various types of tables (collections) such asuser documents and system index files. Because of this, a Linux filesystem cannot predict what portions of the file need to be cached, readahead, swapped out for efficient memory usage, etc.

In embodiments of the methods and systems described herein, there is acommon thread across various applications in their requirements forstorage. In particular, latency and IOPs for specific types of data atspecific times and places of need are very impactful on performance ofthese applications.

For example, to address the host level requirements listed above,disclosed herein are methods and systems for a well fine-tunedfile-system client that enables applications to completely influence andcontrol the storing, retrieving, retaining and tiering of data accordingto preference within the host and elsewhere.

As shown in FIG. 18, a File System (FS) client 1802 keeps separatebuffer pools for separate sessions of a file (fd1 and fd2). It alsopre-allocates and maintains aggregate memory pools for each applicationor set of processes. The SLA-Broker 1804 may be exercised by theapplication either internally within the process/thread where the fileI/O is carried out or externally from another set of processes, toinfluence the FS Client 1802 to provide appropriate storage I/O SLAsdynamically. Controlling the SLA from an external process enables alegacy application with no knowledge of these newer storage controlfeatures immediately without modifying the application itself.

Methods and systems disclosed herein may provide extensive tieringservices for data retrieval across network and hosts. As one can see inFIG. 19 below, a High Performance Distributed File Server (DFS) 1902enables application to run in the Platform 1904 in a containerized formto determine and execute what portions of files should reside in whichmedia (DRAM, NVRAM, SSD or HDDs) in cached form storage formdynamically. These application containers 1908 can determine otherstorage policies such as whether a file has to be striped, mirrored,raided and disaster recovered (DR'ed) as well.

The methods and systems disclosed herein also provide extensive cachingservice, wherein an application container in the High Performance DFS1902 can proactively retrieve specific pages of a file from localstorage and/or remote locations and push these pages to specific placesfor fast retrieval later when needed. For instance, the methods andsystems may local memory and SSD usages of the hosts running theapplication and proactively push pages of an application's interest intoany of these hosts' local memory/SSD. The methods and systems may usethe local tiers of memory, SSD and HDD provisioned for this purpose inthe DFS platform 1904 for very low latency retrieval by the applicationat a later time of its need.

The use of extending the cache across hosts of the applications isimmense. For example, in MongoDB when the working set temporarily growsbeyond its local host's memory, thrashing happens, and it significantlyreduces the query handling performance. This is because when a neededfile data page is discarded in order to bring in a new page to satisfy aquery and subsequently, if the original page has to be brought back, thesystem has to reread the page afresh from the disk subsystem, therebyincurring huge latency in completing a query. Application-driven storageaccess helps these kinds of scenarios by keeping a cache of thediscarded page elsewhere in the network (in another application host'smemory/SSD or in local tiers of storage of the High Performance DFSsystem 1902) temporarily until MongoDB requires the page again andthereby significantly reducing the latency in completing the query.

Referring to FIG. 20, High Performance DFS 1902 takes advantage of DRAMand SSD resources across the application hosts in a single, unified RAMand SSD-based tier/cache 2002, in order to cache and serve theapplication data as necessary and as influenced and controlled by theapplication.

A system comprising of a set of hosts (H1 through HN), a file or blockserver 2102 and a storage subsystem 2104 is disclosed herein as shown inthe FIG. 21. A host H1-HN is typically a computer running an applicationthat needs access to data permanently or temporarily stored in storage.The file or volume server 2102 may be a data organizer and a dataserver, typically running a hardware comprising a central processingunit (CPU), memory and special hardware to connect to external devicessuch as networking and storage devices. The file or volume server 2102organizes user data in terms of multiple fixed or variable number ofbytes called blocks. It stores these blocks of data in an internal orexternal storage. A random, but logically related, sequence of blocks isorganized into a file or a volume. One or more Hosts H1-HN can accessthese files or volumes through an application programming interface(API) or any other protocol. A file or volume server can serve one ormore files and volumes to one or more hosts. It is to be noted that ahost and a file or volume server can be in two different physicalentities connected directly or through a network or they could belogically located together in a single physical computer.

Storage 2104 may be a collection of entities capable of retaining apiece of data temporarily or permanently. This is typically comprised ofstatic or dynamic random access memory (RAM), solid state storage (SSD),hard disk drive (HDD) or a combination of all of these. Storage could bean independent physical entity connected to a File or volume server 2102through a link or a network. It could also be integrated with file orvolume server 2102 in a single physical entity. Hence, hosts H1-HN, fileor volume server 2102 and storage 2104 could be physically collocated ina single hardware entity.

A host is typically comprised of multiple logical entities as shown inFIG. 22. An application 2202 typically runs in a host and would accessits data elements through an API provided by its local operating system2204 or any other entity in place of it. The operating system 2204typically has a standard API interface to interface to a file systemthrough their file system client 2206. A file system client 2206 is asoftware entity running within the host to interface with a file orvolume server 2210 either located remotely or locally. It would providethe data elements needed by application 2202, which are present in asingle or multiple files or volumes, by retrieving them from file orvolume server 2210 and keeping them in the host's memory 2208 until theapplication completes its processing of the elements of data. In atypical application scenario, a specific piece of data would be readand/or modified multiple number of times as required. It is also typicalthat an entire file or volume, consisting of multiple data elements, ispotentially much larger than the size of local memory 2208 in certaintypes of applications. This makes operating system 2204 and file systemclient 2206 more complicated in its implementation in order to decidewhat blocks of data to be retained in or evicted from memory 2208 basedon the prediction that the application 2202 may or may not access themin future. So far, the existing implementations execute some generic andapplication-independent methods, such as first-in-first-out (FIFO) orleast-recently-used (LRU), to retain or evict the blocks of data inmemory in order to bring in new blocks of data from file or volumeserver 2210. Moreover, when a memory occupied by a block of data is tobe reclaimed for storing another block of data, the original data issimply erased without the consideration for its future use. Normally,the disk subsystem in is very slow and incurs high latency when a blockof data is read from it and transferred by file or volume server 2210 tofile system client 2206 to memory 2208. So, when the original block ofdata is erased, the application might have to wait longer if it tries toaccess the original data in near future. The main problem with this kindof implementation is that none of the modules in the path of dataaccess, namely operating system 2204, file system client 2206, memory2208, block server 2210 and storage have any knowledge of what, when andhow often a block of data is going be accessed by application 2202.

An example scenario depicting an application 2202 accessing a block ofdata from storage 2212 is shown in FIG. 23. The numbered circles are toshow the steps involved in the process of accessing a block of data.These steps are explained below. First, application 2202 uses API offile or Operating System 2204 to access a block of data operating system2204 invokes an equivalent API for file system client 2206 to access thesame. Second, file system client 2206 tries to find if the data existsin its local memory buffers dedicated for this purpose. If found, steps(3) through (7) below are skipped. Third, sends a command to retrievethe data from block server 2210. Fourth, block server 2210 sends a readcommand to storage 2212 to read the block of data from the storage.Fifth, storage 2212 returns the block of data to block server 2210 afterreading it from the storage. Sixth, block server 2210 returns the blockof data to file system client 2206. Seventh, file system client 2206saves the data in a memory buffer in memory 2208 for any future access.Eighth, file system client 2206 returns the requested data to theapplication 2202.

In the methods and systems disclosed herein, in order to addressperformance requirements related to data access by most newer class ofapplications in the area of NoSQL and BigData, it is proposed that thecomponents in the data block access comprising operating system 2204,file system client 2206, memory 2208, block server 2210 and storage 2212be controlled by any application 2202. Namely, we propose the following.First, enable operating system 2204 to provide additional API to allowapplications to control file system client 2206. Second, enhance filesystem client 2206 to support the following: (a) allow application 2202to create a dedicated pool of memory in memory 2208 for a particularfile or volume, in the sense, a file or volume will have a dedicatedpool of memory buffers to hold data specific to it which are not sharedor removed for the purposes of other files or volumes; (b) allowapplication 2202 to create a dedicated pool of memory in memory 2208 fora particular session with a file or volume such that two independentsessions with a file or volume will have independent memory buffers tohold their data. As an example, a critically important file session mayhave large number of memory buffers in memory 2208, so that the sessioncan take advantage of more data being present for quicker and frequentaccess, whereas a second session with the same file may be assigned withvery few buffers and hence it might have to incur more delay and reuseof its buffers to access various parts of the file; (c) allowapplication 2202 to create an extended pool of buffers beyond memory2208 across other hosts or block server 2210 for quicker access. Thisenables blocks of data be kept in memory 2208 of other hosts as well asany memory 2402 present in the file or block server 2210; (d) allowapplication 2202 to make any block of data to be more persistent inmemory 2208 relative to other blocks of data for a file, volume or asession. This allows an application to pick and choose a block of datato be always available for immediate access and not let operating system2204 or file system client 2206 to evict it based on their own evictionpolicies; and (e) allow application 2202 to make any block of data to beless persistent in memory memory 2208 relative to other blocks of datafor a file, volume or a session. This allows an application to let knowoperating system 2204 and file system client 2206 to evict and reuse thebuffer of the data block as and when they choose to. This helps inretaining other normal blocks of data for longer period of time. Third,enable block server 2210 to host application specific modules in termsof application container 2400 as shown in the FIG. 24 with the followingcapabilities: (a) enable application container 2400 to fetch blocks ofdata of interest to application 2202 ahead of time and store them inlocal memory 2402 for later quick access and avoid the latency penaltyassociated with storage 2212 and (b) enable storing of evicted pagesfrom memory 2208 of hosts in local memory 2402 for any later access byapplication 2202.

The application driven feature of (2)(c) above needs furtherexplanation. There are two scenarios. The first one involves block ofdata being retrieved from the memory of block server 2210. The otherscenario involves retrieving the same from another host. Assuming theexact same block data has been read from storage 2212 by two hosts (H1)and (H2), the methods and systems disclosed herein provide a system suchas depicted in FIG. 25. When a block of data is noticed to be present inanother host (H2), it is directly retrieved from it by file systemclient 2206 instead asking block server 2210 to retrieve it from storage2212, which will be slower and incurs high latency.

In embodiments, if file system client 2206 decides to evict a block ofdata from (D1) because of storing a more important block of data in itsplace, file system client 2206 could send the evicted block of data tofile system client 2206′ to be stored in memory 2208′ on its behalf.

It should be noted that the abovementioned techniques can be applied toachieving fast failover in case of failure(s) of Hosts. Furthermore thecaching techniques described above; especially pertaining to RAM can useused to achieve failover with a warm cache. FIG. 25 shows an example ofa fast failover system with a warm cache. The end result is that duringa failure of a node, the end application on a new node will not undergoa time period before the cache (in RAM) is warmed and thereby incur aperiod of lower application performance.

Provided herein is a system and method with a processor and a fileserver with an application specific module to control the storage accessaccording to the application's needs.

Also provided herein is a system and method with a processor and a data(constituting blocks of fixed size bytes, similar or different objectswith variable number of bytes) storage enabling an application specificmodule to control the storage access according to the application'sneeds.

Also provided herein is a system and method which retrieves a stale fileor storage data block, previously maintained for the purposes of anapplication's use, from a host's memory and/or its temporary orpermanent storage element and stores it in another host's memory orand/or its temporary or permanent storage element, for the purposes ofuse by the application at a later time.

Also provided herein is a system and method which retrieves any file orstorage data block, previously maintained for the purposes of anapplication's use, from a host's memory and/or its temporary orpermanent storage element and stores it in another host's memory orand/or its temporary or permanent storage element, for the purposes ofuse by the application at a later time.

Also provided herein is a system and method which utilizes memory and/orits temporary or permanent storage element of a host to store any fileor storage data block which would be subsequently accessed by anapplication running in another host for the purposes of reducing latencyof data access.

File or storage data blocks, previously maintained for the purposes ofan application's use, from a host's memory and/or its temporary orpermanent storage element, may be stored in another host's memory orand/or its temporary or permanent storage element, for the purposes ofuse by the application at a later time.

The mechanism of transferring a file or storage data block, previouslymaintained for the purposes of an application's use, from a host'smemory and/or its temporary or permanent storage element to another hostover a network.

In accordance with various exemplary and non-limiting embodiments, thereis disclosed a device comprising a converged input/output controllerthat includes a physical target storage media controller, a physicalnetwork interface controller and a gateway between the storage mediacontroller and the network interface controller, wherein gatewayprovides a direct connection for storage traffic and network trafficbetween the storage media controller and the network interfacecontroller.

In accordance with some embodiments, the device may further comprise avirtual storage interface that presents storage media controlled by thestorage media controller as locally attached storage, regardless of thelocation of the storage media. In accordance with yet other embodiments,the device may further comprise a virtual storage interface thatpresents storage media controlled by the storage media controller aslocally attached storage, regardless of the type of the storage media.In accordance with yet other embodiments, the device may furthercomprise a virtual storage interface that facilitates dynamicprovisioning of the storage media, wherein the physical storage may beeither local or remote.

In accordance with yet other embodiments, the device may furthercomprise a virtual network interface that facilitates dynamicprovisioning of the storage media, wherein the physical storage may beeither local or remote. In accordance with yet other embodiments, thedevice may be adapted to be installed as a controller card on a hostcomputing system, in particular, wherein the gateway operates withoutintervention by the operating system of the host computing system.

In accordance with yet other embodiments, the device may include atleast one field programmable gate array providing at least one of thestorage functions and the network functions of the device. In accordancewith yet other embodiments, the device may be configured as anetwork-deployed switch. In accordance with yet other embodiments, thedevice may further comprise a functional component of the device fortranslating storage media instructions between a first protocol and atleast one other protocol.

With reference to FIG. 26, there is illustrated an exemplary andnon-limiting method of virtualization of a storage device. First, atstep 2600 there is accessed a physical storage device that responds toinstructions in a first storage protocol. Next, at step 2602, there aretranslated instructions between the first storage protocol and a secondstorage protocol. Lastly, at step 2604, using the second protocol, thephysical storage device is presented to an operating system, such thatthe storage of the physical storage device can be dynamicallyprovisioned, whether the physical storage device is local or remote to ahost computing system that uses the operating system.

In accordance with various embodiments, the first protocol is at leastone of a SATA protocol, an NVMe protocol, a SAS protocol, an iSCSIprotocol, a fiber channel protocol and a fiber channel over Ethernetprotocol. In other embodiments, the second protocol is an NVMe protocol.

In some embodiments, the method may further comprise providing aninterface between an operating system and a device that performs thetranslation of instructions between the first and second storageprotocols and/or providing an NVMe over Ethernet connection between thedevice that performs the translation of instructions and a remote,network-deployed storage device.

With reference to FIG. 27, there is illustrated an exemplary andnon-limiting method of facilitating migration of at least one of anapplication and a container. First, at step 2700, there is provided aconverged storage and networking controller, wherein a gateway providesa connection for network and storage traffic between a storage componentand a networking component of the device without intervention of theoperating system of a host computer. Next, at step 2702,the at least oneapplication or container is mapped to a target physical storage devicethat is controlled by the converged storage and networking controller,such that the application or container can access the target physicalstorage, without intervention of the operating system of the host systemto which the target physical storage is attached, when the applicationor container is moved to another computing system.

In accordance with various embodiments, the migration is of a Linuxcontainer or a scaleout application.

In accordance with yet other embodiments, the target physical storage isa network-deployed storage device that uses at least one of an iSCSIprotocol, a fiber channel protocol and a fiber channel over Ethernetprotocol. In yet other embodiments, the target physical storage is adisk attached storage device that uses at least one of a SAS protocol, aSATA protocol and an NVMe protocol.

With reference to FIG. 28, there is illustrated an exemplary andnon-limiting method of of providing quality of service (QoS) for anetwork. First, at step 2800, there is provided a converged storage andnetworking controller, wherein a gateway provides a connection fornetwork and storage traffic between a storage component and a networkingcomponent of the device without intervention of the operating system ofa host computer. Next, at step 2802, without intervention of theoperating system of a host computer, there is managed at least onequality of service (QoS) parameter related to a network in the data pathof which the storage and networking controller is deployed, suchmanaging being based on at least one of the storage traffic and thenetwork traffic that is handled by the converged storage and networkingcontroller.

While only a few embodiments of the present disclosure have been shownand described, it will be obvious to those skilled in the art that manychanges and modifications may be made thereunto without departing fromthe spirit and scope of the present disclosure as described in thefollowing claims. All patent applications and patents, both foreign anddomestic, and all other publications referenced herein are incorporatedherein in their entireties to the full extent permitted by law.

The methods and systems described herein may be deployed in part or inwhole through a machine that executes computer software, program codes,and/or instructions on a processor. The present disclosure may beimplemented as a method on the machine, as a system or apparatus as partof or in relation to the machine, or as a computer program productembodied in a computer readable medium executing on one or more of themachines. In embodiments, the processor may be part of a server, cloudserver, client, network infrastructure, mobile computing platform,stationary computing platform, or other computing platform. A processormay be any kind of computational or processing device capable ofexecuting program instructions, codes, binary instructions and the like.The processor may be or may include a signal processor, digitalprocessor, embedded processor, microprocessor or any variant such as aco-processor (math co-processor, graphic co-processor, communicationco-processor and the like) and the like that may directly or indirectlyfacilitate execution of program code or program instructions storedthereon. In addition, the processor may enable execution of multipleprograms, threads, and codes. The threads may be executed simultaneouslyto enhance the performance of the processor and to facilitatesimultaneous operations of the application. By way of implementation,methods, program codes, program instructions and the like describedherein may be implemented in one or more thread. The thread may spawnother threads that may have assigned priorities associated with them;the processor may execute these threads based on priority or any otherorder based on instructions provided in the program code. The processor,or any machine utilizing one, may include non-transitory memory thatstores methods, codes, instructions and programs as described herein andelsewhere. The processor may access a non-transitory storage mediumthrough an interface that may store methods, codes, and instructions asdescribed herein and elsewhere. The storage medium associated with theprocessor for storing methods, programs, codes, program instructions orother type of instructions capable of being executed by the computing orprocessing device may include but may not be limited to one or more of aCD-ROM, DVD, memory, hard disk, flash drive, RAM, ROM, cache and thelike.

A processor may include one or more cores that may enhance speed andperformance of a multiprocessor. In embodiments, the process may be adual core processor, quad core processors, other chip-levelmultiprocessor and the like that combine two or more independent cores(called a die).

The methods and systems described herein may be deployed in part or inwhole through a machine that executes computer software on a server,client, firewall, gateway, hub, router, or other such computer and/ornetworking hardware. The software program may be associated with aserver that may include a file server, print server, domain server,internet server, intranet server, cloud server, and other variants suchas secondary server, host server, distributed server and the like. Theserver may include one or more of memories, processors, computerreadable media, storage media, ports (physical and virtual),communication devices, and interfaces capable of accessing otherservers, clients, machines, and devices through a wired or a wirelessmedium, and the like. The methods, programs, or codes as describedherein and elsewhere may be executed by the server. In addition, otherdevices required for execution of methods as described in thisapplication may be considered as a part of the infrastructure associatedwith the server.

[000194] The server may provide an interface to other devices including,without limitation, clients, other servers, printers, database servers,print servers, file servers, communication servers, distributed servers,social networks, and the like. Additionally, this coupling and/orconnection may facilitate remote execution of program across thenetwork. The networking of some or all of these devices may facilitateparallel processing of a program or method at one or more locationwithout deviating from the scope of the disclosure. In addition, any ofthe devices attached to the server through an interface may include atleast one storage medium capable of storing methods, programs, codeand/or instructions. A central repository may provide programinstructions to be executed on different devices. In thisimplementation, the remote repository may act as a storage medium forprogram code, instructions, and programs.

[000195] The software program may be associated with a client that mayinclude a file client, print client, domain client, internet client,intranet client and other variants such as secondary client, hostclient, distributed client and the like. The client may include one ormore of memories, processors, computer readable media, storage media,ports (physical and virtual), communication devices, and interfacescapable of accessing other clients, servers, machines, and devicesthrough a wired or a wireless medium, and the like. The methods,programs, or codes as described herein and elsewhere may be executed bythe client. In addition, other devices required for execution of methodsas described in this application may be considered as a part of theinfrastructure associated with the client.

The client may provide an interface to other devices including, withoutlimitation, servers, other clients, printers, database servers, printservers, file servers, communication servers, distributed servers andthe like. Additionally, this coupling and/or connection may facilitateremote execution of program across the network. The networking of someor all of these devices may facilitate parallel processing of a programor method at one or more location without deviating from the scope ofthe disclosure. In addition, any of the devices attached to the clientthrough an interface may include at least one storage medium capable ofstoring methods, programs, applications, code and/or instructions. Acentral repository may provide program instructions to be executed ondifferent devices. In this implementation, the remote repository may actas a storage medium for program code, instructions, and programs.

The methods and systems described herein may be deployed in part or inwhole through network infrastructures. The network infrastructure mayinclude elements such as computing devices, servers, routers, hubs,firewalls, clients, personal computers, communication devices, routingdevices and other active and passive devices, modules and/or componentsas known in the art. The computing and/or non-computing device(s)associated with the network infrastructure may include, apart from othercomponents, a storage medium such as flash memory, buffer, stack, RAM,ROM and the like. The processes, methods, program codes, instructionsdescribed herein and elsewhere may be executed by one or more of thenetwork infrastructural elements. The methods and systems describedherein may be adapted for use with any kind of private, community, orhybrid cloud computing network or cloud computing environment, includingthose which involve features of software as a service (SaaS), platformas a service (PaaS), and/or infrastructure as a service (IaaS).

The methods, program codes, and instructions described herein andelsewhere may be implemented on a cellular network has sender-controlledcontact media content item multiple cells. The cellular network mayeither be frequency division multiple access (FDMA) network or codedivision multiple access (CDMA) network. The cellular network mayinclude mobile devices, cell sites, base stations, repeaters, antennas,towers, and the like. The cell network may be a GSM, GPRS, 3G, EVDO,mesh, or other networks types.

The methods, program codes, and instructions described herein andelsewhere may be implemented on or through mobile devices. The mobiledevices may include navigation devices, cell phones, mobile phones,mobile personal digital assistants, laptops, palmtops, netbooks, pagers,electronic books readers, music players and the like. These devices mayinclude, apart from other components, a storage medium such as a flashmemory, buffer, RAM, ROM and one or more computing devices. Thecomputing devices associated with mobile devices may be enabled toexecute program codes, methods, and instructions stored thereon.Alternatively, the mobile devices may be configured to executeinstructions in collaboration with other devices. The mobile devices maycommunicate with base stations interfaced with servers and configured toexecute program codes. The mobile devices may communicate on apeer-to-peer network, mesh network, or other communications network. Theprogram code may be stored on the storage medium associated with theserver and executed by a computing device embedded within the server.The base station may include a computing device and a storage medium.The storage device may store program codes and instructions executed bythe computing devices associated with the base station.

The computer software, program codes, and/or instructions may be storedand/or accessed on machine readable media that may include: computercomponents, devices, and recording media that retain digital data usedfor computing for some interval of time; semiconductor storage known asrandom access memory (RAM); mass storage typically for more permanentstorage, such as optical discs, forms of magnetic storage like harddisks, tapes, drums, cards and other types; processor registers, cachememory, volatile memory, non-volatile memory; optical storage such asCD, DVD; removable media such as flash memory (e.g. USB sticks or keys),floppy disks, magnetic tape, paper tape, punch cards, standalone RAMdisks, Zip drives, removable mass storage, off-line, and the like; othercomputer memory such as dynamic memory, static memory, read/writestorage, mutable storage, read only, random access, sequential access,location addressable, file addressable, content addressable, networkattached storage, storage area network, bar codes, magnetic ink, and thelike.

The methods and systems described herein may transform physical and/oror intangible items from one state to another. The methods and systemsdescribed herein may also transform data representing physical and/orintangible items from one state to another.

The elements described and depicted herein, including in flow charts andblock diagrams throughout the figures, imply logical boundaries betweenthe elements. However, according to software or hardware engineeringpractices, the depicted elements and the functions thereof may beimplemented on machines through computer executable media hassender-controlled contact media content item a processor capable ofexecuting program instructions stored thereon as a monolithic softwarestructure, as standalone software modules, or as modules that employexternal routines, code, services, and so forth, or any combination ofthese, and all such implementations may be within the scope of thepresent disclosure. Examples of such machines may include, but may notbe limited to, personal digital assistants, laptops, personal computers,mobile phones, other handheld computing devices, medical equipment,wired or wireless communication devices, transducers, chips,calculators, satellites, tablet PCs, electronic books, gadgets,electronic devices, devices has sender-controlled contact media contentitem artificial intelligence, computing devices, networking equipment,servers, routers and the like. Furthermore, the elements depicted in theflow chart and block diagrams or any other logical component may beimplemented on a machine capable of executing program instructions.Thus, while the foregoing drawings and descriptions set forth functionalaspects of the disclosed systems, no particular arrangement of softwarefor implementing these functional aspects should be inferred from thesedescriptions unless explicitly stated or otherwise clear from thecontext. Similarly, it will be appreciated that the various stepsidentified and described above may be varied, and that the order ofsteps may be adapted to particular applications of the techniquesdisclosed herein. All such variations and modifications are intended tofall within the scope of this disclosure. As such, the depiction and/ordescription of an order for various steps should not be understood torequire a particular order of execution for those steps, unless requiredby a particular application, or explicitly stated or otherwise clearfrom the context.

The methods and/or processes described above, and steps associatedtherewith, may be realized in hardware, software or any combination ofhardware and software suitable for a particular application. Thehardware may include a general- purpose computer and/or dedicatedcomputing device or specific computing device or particular aspect orcomponent of a specific computing device. The processes may be realizedin one or more microprocessors, microcontrollers, embeddedmicrocontrollers, programmable digital signal processors or otherprogrammable device, along with internal and/or external memory. Theprocesses may also, or instead, be embodied in an application specificintegrated circuit, a programmable gate array, programmable array logic,or any other device or combination of devices that may be configured toprocess electronic signals. It will further be appreciated that one ormore of the processes may be realized as a computer executable codecapable of being executed on a machine-readable medium.

The computer executable code may be created using a structuredprogramming language such as C, an object oriented programming languagesuch as C++, or any other high-level or low-level programming language(including assembly languages, hardware description languages, anddatabase programming languages and technologies) that may be stored,compiled or interpreted to run on one of the above devices, as well asheterogeneous combinations of processors, processor architectures, orcombinations of different hardware and software, or any other machinecapable of executing program instructions.

Thus, in one aspect, methods described above and combinations thereofmay be embodied in computer executable code that, when executing on oneor more computing devices, performs the steps thereof. In anotheraspect, the methods may be embodied in systems that perform the stepsthereof, and may be distributed across devices in a number of ways, orall of the functionality may be integrated into a dedicated, standalonedevice or other hardware. In another aspect, the means for performingthe steps associated with the processes described above may include anyof the hardware and/or software described above. All such permutationsand combinations are intended to fall within the scope of the presentdisclosure.

While the disclosure has been disclosed in connection with the preferredembodiments shown and described in detail, various modifications andimprovements thereon will become readily apparent to those skilled inthe art. Accordingly, the spirit and scope of the present disclosure isnot to be limited by the foregoing examples, but is to be understood inthe broadest sense allowable by law.

The use of the terms “a” and “an” and “the” and similar referents in thecontext of describing the disclosure (especially in the context of thefollowing claims) is to be construed to cover both the singular and theplural, unless otherwise indicated herein or clearly contradicted bycontext. The terms “comprising,” “haa sender-controlled contact mediacontent item,” “including,” and “containing” are to be construed asopen-ended terms (i.e., meaning “including, but not limited to,”) unlessotherwise noted. Recitation of ranges of values herein are merelyintended to serve as a shorthand method of referring individually toeach separate value falling within the range, unless otherwise indicatedherein, and each separate value is incorporated into the specificationas if it were individually recited herein. All methods described hereincan be performed in any suitable order unless otherwise indicated hereinor otherwise clearly contradicted by context. The use of any and allexamples, or exemplary language (e.g., “such as”) provided herein, isintended merely to better illuminate the disclosure and does not pose alimitation on the scope of the disclosure unless otherwise claimed. Nolanguage in the specification should be construed as indicating anynon-claimed element as essential to the practice of the disclosure.

While the foregoing written description enables one of ordinary skill tomake and use what is considered presently to be the best mode thereof,those of ordinary skill will understand and appreciate the existence ofvariations, combinations, and equivalents of the specific embodiment,method, and examples herein. The disclosure should therefore not belimited by the above described embodiment, method, and examples, but byall embodiments and methods within the scope and spirit of thedisclosure.

All documents referenced herein are hereby incorporated by reference.

What is claimed is:
 1. A converged controller for interfacing a set ofsources and a set of targets with credit-based flow control, thecontroller comprising: a plurality of source-oriented queues, eachsource-oriented queue connected to a different source of the set ofsources; a plurality of target-oriented queues, each target-orientedqueue connected to a different target of the set of targets andconfigured with a number of target access credits; and a multiplexer forselectively coupling a source-oriented queue of the plurality ofsource-oriented queues to at least one target-oriented queue of theplurality of target-oriented queues, wherein the coupling enables anumber of data accesses between a source connected to thesource-oriented queue and a subset of the set of targets connected tothe at least one target-oriented queue according to the credit-basedflow control; wherein the credit-based flow control limits the number ofdata accesses according to a number of credits allocated to the sourceconnected to the source-oriented queue; and wherein the number ofcredits is computed from the number of target access credits of the atleast one target-oriented queue.
 2. The controller of claim 1, whereinthe number of credits allocated to the source connected to thesource-oriented queue is less than or equal to a depth of thesource-oriented queue.
 3. The controller of claim 2, wherein the depthof each of the plurality of source-oriented queues is less than or equalto a total depth of all the plurality of target-oriented queues.
 4. Thecontroller of claim 3, wherein at least one of the set of targets is adirect connected data storage.
 5. The controller of claim 3, wherein atleast one of the set of sources is an ethernet device.
 6. The controllerof claim 1, wherein the number of credits allocated to the sourceconnected to the source-oriented queue is based at least in part on asize of command buffers of the subset of targets.
 7. The controller ofclaim 3, wherein each of the plurality of target-oriented queues aresized according to a size of a command buffer of a connected target. 8.The controller of claim 1, wherein credits are allocated to the sourcein response a data transfer request from the source.
 9. The controllerof claim 1, further comprising a physical storage media controller, aphysical network interface controller and a direct connectiontherebetween for performing data accesses between the source connectedto the source-oriented queue and the subset of targets connected to theat least one target-oriented queue.
 10. A method for source-orientedcredit-based scheduling of data flow : providing a set of target accesscredits to a plurality of target-oriented queues for accessing targetresources; mapping with a multiplexer a source-oriented queue of aplurality of source-oriented queues to a portion of the plurality oftarget-oriented queues; providing a set of source access credits for thesource-oriented queue of the plurality of source-oriented queuesresponsive to a request from at least one of a plurality of sourceresources connected to the plurality of source-oriented queues to accessthe target resources; and limiting a maximum number of source accesscredits for the source-oriented queue of the plurality ofsource-oriented queues based on a total count of target access creditsprovided to the portion of the plurality of target-oriented queues. 11.The method of claim 10, wherein providing the set of target accesscredits further comprises limiting the set of target access credits to asize that is less than or equal to a total depth of the plurality oftarget-oriented queues.
 12. The method of claim 10, wherein at least oneof the target resources is a direct connected data storage.
 13. Themethod of claim 10, wherein at least one of the plurality of sourceresources is an ethernet device.
 14. The method of claim 10, whereinlimiting the maximum number of source access credits further comprisessizing a depth of the source-oriented queue to the maximum number ofsource access credits.
 15. A storage control system comprising: aplurality of source-oriented queues that each provide access credits tonetwork-remote sources requesting access to storage resources controlledby a physical storage controller portion of a converged network-storagecontroller, wherein each of the network-remote sources is a distinctinstance of the converged network-storage controller; a plurality oftarget-oriented queues, wherein each target-oriented queue controlsaccess to a local, physical storage resource by limiting a count oftarget access credits permitted for each local physical storageresource; and a multiplexer for mapping the plurality of source-orientedqueues to the plurality of target-oriented queues, wherein a maximumnumber of access credits permitted for each of the plurality ofsource-oriented queues is limited by the multiplexer to no more than atotal number of target access credits available from the plurality oftarget-oriented queues with which each source queue of the plurality ofsource-oriented queues is multiplexed.
 16. The system of claim 15,wherein access bandwidth and access latency are guaranteed independentof a number of a local, physical storage resources.
 17. The system ofclaim 15, wherein access bandwidth and access latency are guaranteedindependent of a number of converged network-storage controllers.
 18. Amethod of guaranteeing predictable access latency in anetwork-distributed storage system, comprising: multiplexing a pluralityof source-oriented queues to a plurality of target-oriented queues; andlimiting a maximum size of each of the plurality of source-orientedqueues to no more than a combined size of the plurality oftarget-oriented queues with which the plurality of source-orientedqueues are multiplexed.
 19. The method of claim 18, further comprisingallocating credits to a source coupled to a multiplexed source-orientedqueue in response a data transfer request from the source.
 20. Themethod of claim 18, further comprising limiting a count of creditsallocated to a source coupled to at least one of the plurality ofsource-oriented queues to the maximum size of each of the plurality ofsource-oriented queues for a credit-based flow control of data transfer.